0% found this document useful (0 votes)

20 views10 pages

1.2A High-Performance Distributed RAM Based TCAM Arch

The document presents a novel architecture for Ternary Content-Addressable Memory (TCAM) called Distributed RAM based TCAM (D-TCAM), which significantly improves hardware utilization and throughput compared to existing FPGA-based TCAM designs. The proposed D-TCAM architecture utilizes 64 lookup tables (LUTs) to implement a 48-byte TCAM, achieving a 60% reduction in hardware utilization and a 58.8% increase in throughput. This architecture addresses the inefficiencies of current emulated TCAM designs by effectively leveraging the paired nature of LUTs and flip-flops in modern FPGAs.

Uploaded by

Tường Vy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views10 pages

1.2A High-Performance Distributed RAM Based TCAM Arch

Uploaded by

Tường Vy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

D-TCAM: A High-performance
Distributed RAM based TCAM
Architecture on FPGAs
MUHAMMAD IRFAN1 , (Student Member, IEEE), ZAHID ULLAH2 , (Member, IEEE), and RAY C.
C. CHEUNG 1 , (Member, IEEE)
1
Department of Electrical Engineering, City University of Hong Kong, Kowloon Tong, Hong Kong (e-mail: m.irfan@my.cityu.edu.hk, racheung@cityu.edu.hk)
2
Department of Electrical Engineering, CECOS University of IT and Emerging Sciences, Pakistan (e-mail: zahidullah@cecos.edu.pk)
Corresponding author: Muhammad Irfan (e-mail: m.irfan@my.cityu.edu.hk).
This work is supported by City University of Hong Kong Collaborative Research Fund (CRF): 8730043.

ABSTRACT
Ternary content-addressable memory (TCAM) is a high-speed searching device that searches the entire
memory in parallel in deterministic time, unlike random-access memory (RAM) which searches sequen-
tially. A network router classifies and forwards a data packet with the aid of a TCAM that stores the routing
data in a table. FPGAs, due to its hardware-like performance and software-like reconfigurability, are widely
used in networking systems where TCAM is an essential component. TCAM is not included in modern
FPGAs which leads to the emulation of TCAM using available resources on FPGA. Several emulated
TCAM designs are presented but they lack the efficient utilization of FPGA’s hardware resources. In this
paper, we present a novel TCAM architecture, distributed RAM based TCAM (D-TCAM), using D-CAM
as a building block. One D-CAM block implements a 48-bytes TCAM using 64 lookup tables (LUTs), that
is cascaded horizontally and vertically to increase the width and depth of TCAM, respectively. A sample
size of 512×144 is implemented on Xilinx Virtex-6 FPGA which reduced the hardware utilization by 60%
compared to the state-of-the-art FPGA-based TCAMs. Similarly, by exploiting the LUT-flip-flip (LUT-FF)
pair nature of Xilinx FPGAs, the proposed TCAM architecture improves throughput by 58.8% without any
additional hardware cost.

INDEX TERMS Content-addressable memory (CAM), field-programmable gate array (FPGA), lookup
table (LUT), memory, random-access memory (RAM).

I. INTRODUCTION e.g., software defined networks (SDNs) [8]. A routing table

ONTENT-addressable memory (CAM) is a special type is maintained in the TCAM which stores the destination
C of associative array that performs parallel searching of
the stored rules. It returns the address of the matched rule
addresses and the ports to which the data packet needs to
be forwarded. The router compares the destination address
in one clock cycle [1]. CAMs are classified into two major (“0 1 0 1") of the incoming packet with the stored addresses
types: binary CAM (BiCAM) supports only two memory and chooses the appropriate port (Port A), as shown in
states (‘0’ and ‘1’) and ternary CAM (TCAM) supports three Fig. 1. The match line of TCAM acts as a pointer to the
memory states (‘0’, ‘1’, and ‘X’). ‘X’ state is known as don’t corresponding word (Port number) in static random-access
care bit or wildcard. memory (SRAM).
TCAM is widely used in many applications where the Field-programmable gate arrays (FPGAs) provide a large
searching speed is imperative [2], [3], i.e., digital signal amount of logic and memory components which make them
processing, artificial intelligence [4], translation lookaside extremely favorable for the development of emerging sys-
buffers (TLBs) in soft processors [5], cache memory, and tems [9]. Although TCAM is vital element in networking
gene pattern recognition [6] in bioinformatics. Packet classi- applications, modern FPGAs lack a hard core for TCAM
fication and packet forwarding maintain the high throughput because of its many other applications where TCAM would
of the system using TCAM in networking applications [7], not be used. It was supported by some old FPGAs such as

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

works (SDNs). The presence of TCAM in FPGAs is crucial

to implement such complex systems.
Some legacy FPGA devices, such as FLEX and APEX
TCAM RAM families, previously provided support for CAM and it was
0 1 0 X possible to implement a BCAM using an additional transis-
Port A tor. Lattice ispXPLD also supported TCAM up to size 128
0 1 01 0 X 1 X Port B Port A × 48 using its multi function blocks (MFBs) [17]. But all
Inpu t
1 1 0 X Port C Outpu t
of these are discontinued. Modern FPGAs including those
1 0 X X Port D from Xilinx does not have a hard IP to support TCAM.
Thus, several emulated TCAMs are proposed using the mem-
ory resources inside FPGAs, which are known as FPGA-
00
Enc Dec based CAMs. In this paper, we propose a novel FPGA-based
TCAM architecture which efficiently utilizes the available
FIGURE 1: RAM/TCAM implementation of a routing table. hardware resources on FPGA and improves throughput com-
pared to the state-of-the-art FPGA-based TCAMs [11], [15].

those from Altera (FLEX) [10] and Lattice in the form of B. KEY CONTRIBUTIONS
partial memory blocks and multi-function blocks (MFBs), Key contributions of the proposed work are as follows:
but TCAM is not included in the latest release of these • The proposed TCAM design, D-TCAM, outperforms

devices. What if we need a TCAM for the implementation the available FPGA-based TCAMs in terms of hardware
of networking system inside an FPGA? utilization on FPGA and shows an improvement of 60%
Emulated TCAMs are the alternatives which are proposed compared to the available state-of-the-art TCAMs [11].
by many researchers to utilize the hardware resources of • The available architectures use either lookup tables

FPGA to design a TCAM inside FPGAs [11], [12]. FPGA (LUT) [15] or flip-flops (FF) [18], [19] to implement
vendors including Altera [13] and Xilinx [11], [14] also pro- TCAM and the other one is inferred by synthesizer but
vide soft IP (intellectual property) cores for TCAM to support not part of the design; thus LUT-FF pairs are ineffi-
its applications, but the memory utilization is inefficient in ciently utilized in such designs. D-TCAM efficiently
their current versions. Xilinx recently released an emulated exploits the paired nature of LUTs and FFs in modern
TCAM to support the SDNs [11] which lacks efficient usage FPGAs, and utilize it coherently, such that, none of it
of memory resources on FPGA. infers useless. Speed of the design is improved by using
In this work, we propose a novel TCAM architecture using these redundant FFs.
• D-TCAM provides better performance as per through-
distributed-RAM (D-TCAM) which outperforms the state-
of-the-art FPGA-based TCAMs by efficiently utilizing the put of the TCAM design and shows an improvement
hardware resources. We exploit the physical structure of of 58.8% compared to the state-of-the-art FPGA-based
FPGA slices to improve the throughput of TCAM by intro- TCAMs [11], [15].
ducing pipeline registers to the design. The improvements are
III. RELATED WORK
in the form of 60% better hardware utilization and 58.8%
Conventional TCAMs based on application-specific inte-
enhanced throughput compared to the state-of-the-art FPGA-
grated circuits (ASICs) have low storage density, less recon-
based TCAMs [11], [15], [16].
figurable, and challenging to integrate with FPGA-based sys-
The rest of the paper is arranged as follows: Section II
tems [20], [21]. Emulation of TCAM using FPGA’s hardware
provides the motivations and key contributions of the pro-
resources is a better option to integrate with FPGA-based
posed design. Section III discusses the related work. Section
searching applications. The focus of this work is on emulated
IV describes the proposed TCAM architecture. Section V
TCAMs which is a mature research area and consists of a
illustrates the technical feasibility of the proposed design
large number of different architectures collectively known as
and comparisons with state-of-the-art TCAMs. Section VI
FPGA-based TCAMs.
concludes this work and highlights future directions.
Based on the memory resources that are utilized to emulate
TCAM functionality, FPGA-based TCAMs are classified
II. MOTIVATIONS AND CONTRIBUTIONS into three categories.
A. MOTIVATIONS 1) The first is block RAM (BRAM)-based TCAMs [15],
TCAM is an essential element in modern networking appli- [16], [22]–[25] which use FPGA’s BRAMs to store the
cations where it provides the functionality of packet classi- information of TCAM data. HP-TCAM [22] divided the
fication and packet forwarding in the form of implement- TCAM table into horizontal and vertical blocks and stored
ing a routing table. FPGAs, having considerable amount of it separately in BRAMs of FPGA. It was not possible with-
memory and logical resources, are extremely favorable for out division to implement on FPGA resources because the
complex reconfigurable systems, e.g., software defined net- BRAMs required for the architecture presented in patent [26]
2 VOLUME 4, 2016

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

increases exponentially with increase in width of TCAM. as well as throughput by 60% and 58.8%, respectively. In
The information about the presence of a rule and its ad- DURE [30] one slice is utilized to implement 1×18 block
dress is stored in different BRAMs which inefficiently utilize of TCAM and carry chain logic is used to roll the AND
BRAMs to emulate a small size (512×36) of TCAM. Z- logic of the match lines for the subsequent slices. It used 4
TCAM [27] and E-TCAM [28] are slightly improved ver- LUTRAMs to emulate 18 bits of TCAM while our proposed
sions of HP-TCAM, but they also suffer from inefficient TCAM architecture uses 4 LUTRAMs to emulate 24 bits
memory utilization. The major drawback of HP-TCAM, Z- TCAM. Our proposed design provides a building block of
TCAM, and E-TCAM is the requirement of TCAM rules 64×6 TCAM which can be expanded horizontally and verti-
to be in ascending or descending order; otherwise, it fails cally to implement large sizes of TCAM.
to provide the match address. UE-TCAM [23] reduced the Distributed RAM-based TCAM architectures waste a large
hardware utilization by storing the information about the number of flip-flops in FPGAs due to the LUT-FF pairs
presence of TCAM rule and its address in the same BRAM. as it only needs LUTs to implement the design which we
The TCAM rules are not needed to be arranged in ascending call as inefficient utilization of memory resources. We have
or descending order. designed the proposed architecture by keeping in mind the
REST [24] proposed a hardware-efficient architecture by hardware structure of FPGA slices and its other components
accessing the BRAM multiple times which affected the which is explained in the upcoming sections.
throughput of TCAM. Jiang [15] presented a scalable TCAM 3) The third type of FPGA-based TCAMs in the literature
architecture in a sophisticated manner and utilized a large are register-based TCAMs [18], [19], [31], [32] which use
number of BRAMs to implement 1024×150 TCAM. How- FFs to store the TCAM rules and perform parallel compar-
ever, the design is entirely not clear from the description of isons of each flip-flop with the input search key. They are also
the paper that how such a large size is achieved. Our pro- known as gate-based TCAMs. FFs in FPGA terminology are
posed architecture utilizes the LUTRAM resources on FPGA known as slice registers (SRs). Gate-based TCAMs reduces
according to its structure (LUT-FF pairs) which improves the the hardware resources in terms of slices per TCAM bit
throughput as well as hardware utilization. Multi-pumping but lacks scalability because of its high routing complexity.
[29] approach is used to access the BRAM memory multiple LH-CAM [18] and G-AETCAM [19] are implemented as
times in one system clock cycle to achieve a reduction in 64×36 BiCAM and TCAM, respectively, on Xilinx FPGA.
hardware utilization, but the speed is reduced with multiple It showed improvement in hardware resources and speed for
accesses. Our proposed design is not compromising on the small size TCAM, but for larger sizes, it is not a good option
speed of the TCAM architecture and improves the hardware to emulate TCAM on FPGA.
utilization as well as the throughput of the emulated TCAM Some other sub-types include hash-based TCAMs [7],
design. algorithmic TCAMs, pseudo-TCAM [2], and cuckoo hashing
2) The second type of FPGA-based TCAM is the em- [33] which have non-deterministic search latency and have
ulation using distributed-RAM resources on FPGAs [11], the drawback of collision as well as bucket overflow. Algo-
[14]–[16]. Most of the previous works that are designed rithmic TCAMs perform best for a specific set of rules as
initially using BRAMs are implemented using distributed- they are based on a specific pattern in the input search key
RAMs. Xilinx in 2012 presented a TCAM architecture in [34]. Updating procedures [35], [36] and soft errors detection
its application note using SRL16E which uses 4-input LU- [9] is another research direction in FPGA-based TCAMs.
TRAM as shift register [14]. One SRL16E implements 2- The scope of this paper is to propose a novel TCAM design
bits of TCAM. If the same approach is applied to 6-input which outperforms in hardware utilization and throughput
LUTRAM, only 4-bit TCAM can be emulated. Our proposed compared to the state-of-the-art TCAM architectures.
TCAM architecture utilizes 6-input LUTRAM to emulate Our proposed architecture is based on RAM of FPGA,
6-bit TCAM and combines 64 LUTRAMS to emulate 46 so only type 1 and type 2 FPGA-based TCAMs are part of
bytes of TCAM. Jiang [15] also proposed a distributed RAM- our comparisons as both are using the RAM available in
based TCAM architecture which uses a huge number of slices FPGAs. To make a fair comparison between the BRAMs and
to implement a TCAM of size 1024×150. It uses 20526 distributed RAM, we computed the number of normalized
slices to implement 1024×150 TCAM on Xilinx Virtex-7 slices for all of the TCAM architectures which are discussed
FPGA while our proposed TCAM design uses 4835 slices in Section V.
to implement 512×144 TCAM on Xilinx Virtex-6 FPGA.
The throughput of Jiang TCAM design and our proposed D- IV. PROPOSED ARCHITECTURE
TCAM are 20.9 and 37.3 Gb/s, respectively. A. TERMINOLOGY
Xilinx in 2017 released a soft IP core for SDNet which Table 1 lists the important terminlogies used in the proposed
uses 12011 slices and 3 BRAMs (36Kb) to implement TCAM design.
512×128 TCAM [11]. The architecture’s internal design is
not disclosed in the IP data sheet, and only implementation B. FPGA STRUCTURE
results are mentioned. Our proposed TCAM architecture Modern FPGAs consist of reconfigurable hardware compo-
shows improvement than Xilinx design in slice utilization nents, to implement any digital system, supported by the
VOLUME 4, 2016 3

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: Terminologies (Notations for D-TCAM)

Notation Definition
Sk Search key (W bits) 0 0 0 0
W Width of the D-TCAM 0 1 0 1
D Depth of the D-TCAM
m Number of D-CAM blocks in a column 0 0 0 0
n Number of D-CAM blocks in a row 1 1 0 0
MLs Match lines. Output of D-TCAM going into
the priority encoder 3 0 1 3 0 1 3 1 0 2
ML_Vs Match line vecotrs of 64 bits. Ouptput of
each D-CAM block
0 0 1 1
Sw Sub-word formed by dividing Sk into n 6-bit 0 0 0 0
words
0 0 0 0
8x1 1x3 8x1 1x3 8x2 2x3
BRAM DSP block CLB IOB RAM TCAM RAM TCAM RAM TCAM
(a) (b) (c)

0 0 0
0 1 0 0 0 1
1 0 0 1 0 0
0 1 1 0 0 0 0 2

1 1 3 0 0 0 2 3 1 0
3 0 0 3 1 0 1 1
0 0 1 1 0 0
0 0 0 0 0 0
0 0 0
1 2
8x2 1x6 8x4 2x6
RAM TCAM RAM TCAM
(d) (e)

FIGURE 2: Internal Structure of an FPGA. (BRAM: Block FIGURE 3: Converting LUTRAMs into TCAM. (a,b) One 3-
RAM, DSP block: Digital signal processing block, CLB: input LUTRAM emulating 1×3 TCAM. (c) Two 3-input LU-
Configurable logic block, IOB: Input/output block). TRAMs emulating 2×3 TCAM. (d) Two 3-input LUTRAMs
emulating 1×6 TCAM with a 2-input AND gate. (e) Four
3-input LUTRAMs emulating 2×6 TCAM with two 2-input
interconnection matrix. The main parts of Xilinx FPGAs AND gates.
are block RAMs (BRAMs), digital signal processing (DSP)
blocks, configurable logic blocks (CLBs), and input/output
(I/O) blocks (IOBs) as shown in Fig. 2. CLB consists of (Sk) of 1×3 TCAM. Only those location(s) stores ‘logic-
two types of slices; SLICEM and SLICEL. Each slice has 1’ whose address corresponds to the rule present in TCAM,
four lookup tables (LUTs) and eight flip-flops (also known otherwise ‘logic-0’ is stored. For instance, 3-input LUTRAM
as SRs). LUTs inside a SLICEL can be configured as a logic saves the TCAM rule "0 1 1" which made the 3rd location of
element which implements any function up to 6-inputs (using LUTRAM to be ‘logic-1’, and all others ‘logic-0’ as shown
one LUT), and are known as function generators. LUTs in Fig. 3(a). If the rule in TCAM has ‘X-bit’(s), multiple
inside a SLICEM can be configured as a logic element as locations of LUTRAM are ‘logic-1’. For instance, 3-input
well as a memory element, and are known as LUTRAMs. LUTRAM saves the TCAM rule "0 X 1", which make the
Each LUTRAM is a column memory of 64 cells which can be 1st and 3rd location of LUTRAM ‘logic-1’, as shown in Fig.
used as 64-bit RAM. Multiple LUTRAMs combine to form 3(b).
a distributed RAM. Two 3-input LUTRAMs are cascaded to form a 2×3
TCAM. LUTRAMs in Fig. 3(c) store “1 0 X" and “X 0 1"
C. LUTRAM AS TCAM as two 3-bit TCAM rules. Similarly, two 3-input LUTRAMs
A 3-input LUTRAM saves the information of 1×3 TCAM. are connected in parallel with output ANDed to form 1×6
The address input of the LUTRAM is treated as search key TCAM. LUTRAMs in Fig. 3(d) stores "1 0 X 0 1 0" as one
4 VOLUME 4, 2016

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

6-input rule. The input search key (Sk) equal to “1 0 0 0 1 be read and write using the Address and Data_in as inputs,
0" and “1 0 0 0 1 0" matches with the stored rule to provide and Data_out as the output of the whole memory block
match line (ML) equal to ‘logic-1’as output. (D-CAM). The same memory of Fig. 4(a) is used in Fig.
Four 3-input LUTRAMs are connected such that two LU- 4(b) with a priority encoder, and is changed into a TCAM
TRAMs in parallel and two in cascade form to emulate a 2×6 memory.
TCAM. In Fig. 3(e) two TCAM rules of 6 bits each (“1 0 0 Address in Fig. 4(a) represents the search key (Sk) of
X 1 0" and “1 X X 0 0 1") are stored. TCAM memory in Fig. 4(b). Data_out in Fig. 4(a) represents
To simplify the understanding, we explained the emula- the match lines (MLs) of TCAM memory in Fig. 4(b). The
tion of LUTRAMs into TCAM using 3-input LUTRAMs. final output of a TCAM is always an address where the search
Modern FPGAs have mostly 6-input LUTRAM as in Xilinx key is found. A priority encoder (PE) is normally used to
FPGAs. In the proposed TCAM design we use 6-input LU- translate match lines into an address but PE in itself has dif-
TRAMs which save information of 1×6 TCAM in a similar ferent implementation designs and is not part of the proposed
way as described for 3-input LUTRAM in Fig. 3(a). A D- design. Fig. 4 (b) shows PE to ease the understanding of a
CAM block has 64 6-input LUTRAMs which implements typical TCAM data flow.
a 64×6 TCAM. Multiple D-CAM blocks are connected to
form larger TCAMs similarly as described for multiple 3- 6 D-CAM 64

input LUTRAMs in cascade and parallel form as shown in [5:0] block [63:0]
12 64
Fig. 3(e). Sw MLs
[11:0] [63:0]
6 D-CAM 64

[11:6] block [63:0]

Address (a)
6 Data_out
SRAM Memory
(64 x 64) Sw TCAM 64 Sw TCAM 128
Data_in 64 (64x12)
12
MLs
6 (128x6) MLs

(a) 6 64
64 x 6 TCAM D-CAM
[5:0] block [127:64]
One D-CAM block matchlines 6 128
Sw MLs
Sk [5:0] [127:0]
6 D-CAM 64

6 Add block
SRAM Memory P [5:0] [63:0]

Data (64 x 64) 64 E 6 (b)

MLs FIGURE 5: Multiple D-CAM blocks to expand D-TCAM.
64
(a) 64×12 D-TCAM using two D-CAM blocks connected
in parallel form. (b) 128×6 D-TCAM using two D-TCAM
(b)
blocks connected in cascade form.

FIGURE 4: A building block of D-TCAM. (Sk: Search key, E. MODULAR STRUCTURE OF D-TCAM
MLs: Match lines, PE: Priority encoder). (a) SRAM memory Each D-CAM block implements 48 bytes of TCAM with
for one D-CAM block. (b) 64×6 TCAM emulated using configuration as 64×6, where 64 is the depth (D) of TCAM
64×64 SRAM memory made of 64 LUTRAMs and a priority and 6 is the width (W) of TCAM. Increasing the width of
encoder. TCAM from 6 bits to 12 bits requires an additional D-CAM
block by keeping the depth of TCAM constant, as shown in
Fig. 5(a). Similarly, increasing the depth of TCAM from 64
D. ONE D-CAM BLOCK to 128 requires an additional D-CAM block by keeping the
D-CAM block is the basic building block of the proposed width of TCAM constant, as shown in Fig. 5(b). Horizontal
architecture, D-TCAM. One D-CAM block consists of 64 6- expansion of D-CAM blocks is done to increase the width
input LUTs. These LUTs are from SLICEM slices of Xilinx while the vertical expansion of D-CAM blocks is done to
FPGAs, which are also known as LUTRAMs. Fig. 4(a) shows increase the depth of emulated D-TCAM. Expansion in both
a D-CAM block which is a distributed memory of 64×64 directions is done to create the desired size configuration of
consisting of 64 LUTRAMs. All of the memory locations can TCAM on FPGAs. Fig. 6 shows the formation of 128×12
VOLUME 4, 2016 5

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

6 D-CAM 64 Stage 1 Pipeline Stage 2

[5:0] block [63:0] Registers
64

[63:0]
6 D-CAM 64 Sk MLs
block
D-TCAM
[11:6] [63:0] MLs AND-
128 [Cascaded/Parallel FFs
Sw
12
ing
[11:0] [127:0]
DCAM blocks]
[127:64]
D-CAM
[5:0] block 64
64

[127:64] LUT-FF pair

64
6 D-CAM
[11:6] block [127:64]
FIGURE 7: Lookup table (LUT)-flip-flop(FF) pair structure
inside FPGAs. (Sk: Search key, MLs: Match lines, ANDing:
Sw
Logical AND operation from multiple D-CAM blocks).
TCAM 128

12 (128x12) MLs

the form of slices to implement the distributed RAM-based

TCAM designs. LUTs are needed by these design, but FFs
FIGURE 6: Horizontal and Vertical expansion. 128×12 D- are not. However, due to the LUT-FF pair nature of FPGA,
TCAM using four D-CAM blocks. FFs are also inferred by the synthesizer while they are not
part of the design. It is a waste of area on the reconfigurable
device, i.e., FPGA. We call those FFs as redundant flip-
TCAM using four D-CAM blocks connected in parallel and flops. Our proposed TCAM architecture uses redundant FFs
cascade form. as pipeline registers.
Scalability of the proposed design is shown by emulating Pipelining in the digital system improves the performance
a D-TCAM of size 512×144 using 192 D-CAM blocks of the whole system, and throughput is improved in most of
connected in parallel and cascade form. the cases. We have divided our proposed TCAM design into
two pipeline stages, separated from each other using pipeline
F. LUT-FF PAIRS registers. The redundant FFs of each LUT-FF pair are used
FPGA’s hardware components are utilized in the form of as pipeline registers as shown in Fig. 7. The proposed design
slices (CLBs), BRAMs, DSP blocks, and I/O blocks to saves the information about the presence of input search
synthesize any digital design. LUTs and FFs are arranged key as well as the address where it is found in LUTRAMs
as LUT-FF pair inside the slices. They cannot be used in- in an efficient way to reduce the number of SRAM cells
dependently. For instance, if a digital design requires 200 per TCAM bit compared to other LUTRAM-based TCAMs.
LUTs and no FF, the synthesizer will assign 50 slices to Similarly, the redundant FFs are used as pipeline registers to
it, which contains FFs as well. The 200 FFs are utilized divide the TCAM design into two stages which improved the
(synthesized/inferred) by the digital design but not used in throughput compared to the state-of-the-art TCAM designs.
real, which is a wastage of area in FPGA. Can we use these The number of slices utilized by the TCAM design without
flip-flops to improve the performance of the system? Yes, we pipelining and with pipelining is same, but the performance
can! Let’s use these flip-flops for pipelining the design. The is improved in the latter case, which is the beauty of this work
motivation is to design the digital systems according to the and is shown with FPGA implementation in the next section.
available hardware to efficiently utilize the area and hardware
resources. The idea is to use these redundantly inferred FFs H. POPULATING OF D-TCAM
of the LUT-FF pairs for pipelining. How is the data from a TCAM table mapped into D-TCAM
In this work, we are exploiting the LUT-FF pair nature of which consists of D-CAM blocks? The mapping of TCAM
the hardware resources inside FPGAs and propose a novel data into the emulated TCAM is known as populating.
design to utilize the available hardware resources efficiently. TCAM table is partitioned into m rows and n columns of
We use the redundant flip-flops as pipeline registers which sub-tables according to the dimensions of D-CAM block,
improved the throughput of the proposed TCAM design. i.e., 64×6. Each row of the partitioned TCAM table consists
of 64 rules while each column consists of 6 bits. The m*n
G. PIPELINING sub-tables of TCAM are mapped into the corresponding D-
The available designs [11], [15], [16] that are based on CAM blocks to form a D-TCAM of size D×W, where D
distributed RAM use LUTRAMs to implement TCAM and is the depth of TCAM and W is the width of TCAM, as
need only LUTs from the LUT-FF pairs inside a slice. A large shown in Algorithm 1. Variable ‘t’ in the algorithm represents
number of LUT-FF pairs are inferred by the synthesizer in the corresponding sub-table of TCAM mapping to D-CAM
6 VOLUME 4, 2016

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Algorithm 1 Populating Algorithm (D×W TCAM) TABLE 2: Speed (in MHz) of D-TCAM for different sizes
Input: TCAM Table (D×W) of TCAM. (W: Width of TCAM, D: Depth of TCAM, The
Output: D-TCAM Table (m*n sub-tables) left value is Speed of D-TCAM without pipeline registers
1: for t ← 0 to t ← m*n do while the right value is the Speed of D-TCAM with pipeline
2: for i ← 0 to i ← m do registers)
3: for j ← 0 to j ← n do Without pipeline registers — With pipeline registers
4: D-CAM[i][j] <= TCAM sub-table [t] Width (W) 18 36 72 144
5: end for Depth (D)
64 513 — 793 459 — 717 368 — 526 353 — 492
6: end for 128 425 — 685 459 — 701 345 — 459 258 — 332
7: end for 256 369 — 450 376 — 510 238 — 351 276 — 370
m × n: Total number of sub-tables in TCAM which is 512 362 — 444 326 — 460 188 — 214 207 — 259
equal to the number of D-CAM blocks in D-TCAM.
m: Number of sub-tables in a column. TABLE 3: Number of Slices used by D-TCAM for different
n: Number of sub-tables in a row. sizes of TCAM. (W: Width of TCAM, D: Depth of TCAM,
The left value is the slices used by D-TCAM without pipeline
registers while the right value is the slices used by D-TCAM
blocks of D-TCAM. If we apply algorithm 1 on Fig. 6, t is with pipeline registers)
4. Theoretically, a total of m*n*64 LUTRAMs are required Without pipeline registers — With pipeline registers
to implement D×W D-TCAM because each D-CAM block W 18 36 72 144
constitutes of 64 LUTRAMs. D
Update latency of RAM-based TCAMs is proportional to 64 76 — 68 123 — 119 280 — 288 556 — 562
128 128 — 128 254 — 254 587 — 583 1078 — 1067
the depth of the utilized RAM blocks. BRAM-based TCAMs 256 271 — 258 458 — 462 1205 — 1168 2435 — 2395
have a 513 clock cycle update latency while distributed-RAM 512 538 — 536 930 — 968 2387 — 2357 4842 — 4835
based TCAMs have an update latency of 65 clock cycles.
Our proposed design is based on LUTRAMs which can
be correspondingly updated in 65 clock cycles as a typical lines (MLs) which are converted to an address by the priority
distributed RAM-based TCAM. encoder (PE) as shown in Algorithm 2.

Algorithm 2 Searching of D-TCAM for potential match V. PERFORMANCE EVALUATION

Input: Search key (Sk) A. FPGA IMPLEMENTATION
Output: m Match line vectors (ML_Vs) of 64 bits each To evaluate the feasibility of the proposed design, D-TCAM
which is converted to potential match address by PE. is successfully implemented on Xilinx Virtex-6 FPGA for the
1: for i ← 0 to i ← n do size 512×144. The FPGA device XC6VLX760 is used with
sub-word [i] <= Sk [6 ∗ n : (6 ∗ n) + 5] ; −2 speed grade using Xilinx ISE Design Suite 14.5. A large
2: end for number of TCAM dimensions are implemented by varying
3: for j ← 0 to j ← m do the depth (D) and width (W) of the emulated TCAM which
ML_V [j] <= D-CAM [j][subword(0)] AND proves the scalability and modular structure of the proposed
D-CAM [j][subword(1)] AND TCAM architecture. The number of slices used by the emu-
. lated design on Xilinx Virtex-6 FPGA is given in Table 2,
. where the left and right values are for the design without
. pipeline registers and with pipeline registers, respectively.
D-CAM [j][subword(n)] ; Similarly, the speed of the emulated design is given in Table
4: end for 3, where the left and right values are for the design without
5: PE <= ML_V [m:0]; pipeline registers and with pipeline registers, respectively.
m: Number of sub-tables in a column.
n: Number of sub-tables in a row. B. COMPARISON WITH OTHER TCAM ARCHITECTURES
PE: A priority encoder which converts the match line FPGA-based TCAMs discussed in Section 3 use different
vectors into the potential match address. memory resources, such as distributed RAM, BRAM, and
registers (gate-based TCAMs). Number of BRAMs cannot
be directly compared with the number of distributed RAMs
I. SEACHING OPERATION
because of its different structure and memory capacity. To
balance it, DURE [30] adapted an equation to calculate the
The input search key (Sk) is divided into n 6-bit sub-words
normalized slices (Slices0 ) which is given in (1).
(Sw) and are provided to its corresponding n D-CAM blocks.
Sw is searched in m D-CAM blocks in parallel to generate
64-bit match line vectors (ML_Vs). The m ML_Vs from each Slices0 = N o. of slices + N o. of BRAM s(36k) ∗ 24
of the n D-CAM blocks are ANDed to form m*64 match (1)
VOLUME 4, 2016 7

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Without Pipeline Registers With Pipeline Registers

6000 6000
18-bits 36-bits 72-bits 144-bits 18-bits 36-bits 72-bits 144-bits

5000 5000
Slices Used (#)

Slices Used (#)

4000 4000

3000 3000

2000 2000

1000 1000

0 0
64 128 256 512 64 128 256 512
TCAM depth (D) TCAM depth (D)
(a) (b)

900 900
18-bits 36-bits 72-bits 144-bits 18-bits 36-bits 72-bits 144-bits

750 750
Speed (MHz)

Speed (MHz)
600 600

450 450

300 300

150 150

0 0
64 128 256 512 64 128 256 512
TCAM depth (D) TCAM depth (D)
(c) (d)

FIGURE 8: To visualize the difference between using of pipeline registers and not using of pipeline registers. (a) Number of
slices used vs. TCAM depth without Pipelining. (b) Number of slices used vs. TCAM depth with Pipelining. (c) Speed of the
design vs. TCAM depth without Pipelining. (d) Speed of the design vs. TCAM depth with Pipelining. The four different lines
of each graph are for different TCAM widths (18, 36, 72, and 144 bits).

For instance, HP-TCAM is implemented using BRAM, using (3) where Speed is the maximum clock rate of the
and its utilization is published in the form of BRAMs. So we emulated TCAM and W is the width of the emulated TCAMs.
converted the BRAMs to normalized slices using (1) which
is provided in the comparison table as the number of slices T hroughput = Speed ∗ W (3)
(Slices0 ). Most of the TCAM designs under comparison
are implemented using Virtex-6. We also implemented D- C. IMPACT OF PIPELINING
TCAM on Virtex-6 to make the comparisons fair. Some of The TCAM architectures that used LUTRAMs for emulating
the architectures are implemented on different devices which TCAM are not using FFs from the LUT-FF pairs of FPGA’s
use different technology, e.g., Virtex-7 is used by Jiang [15] slices. We have used those redundant FFs as pipeline registers
and Xilinx [11] while Kintex-7 is used by REST [24]. Virtex- to improve the speed of the proposed TCAM architecture.
6 is a 40 nm device while Virtex-7 and Kintex-7 are 28 nm The pipeline registers are FFs which requires additional
devices, which are also different in speed. For comparisons, slices to implement in a design, but in our case, almost
the normalized speed is calculated using (2), where Speed’ is no additional slices are used, but the speed improvement
the normalized speed on the target device and Vdd is always is prominent. To visualize it more clearly we have plotted
equal to 1.0 V. four graphs showing the change in slice usage and change
in the speed of the proposed design by using and not using
Device (nm)
Speed0 = Speed ∗ [ ] (2) pipeline registers. The horizontal dotted lines in Fig. 8(a)
V irtex 6 (nm) and Fig. 8(b) shows that the increase in slices is almost
Throughput (in Gbits/s) of the proposed design is found negligible, but the horizontal dotted lines in Fig. 8(c) and
8 VOLUME 4, 2016

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Fig. 8(d) shows that the increase in speed is prominent. [12] P. Alfke, “Creative uses of block RAM,” White Paper: Virtex and Spartan
Speed of all emulated designs of D-TCAM is greater than the FPGA Families, Xilinx, 2008.
[13] Altera, “Implementing High Speed Search Applications with Altera
speed of the design without pipeline registers. Our proposed CAM,” Application Note 119 V2.1, July. 2001.
architecture shows improvement in speed of TCAM without [14] K. Locke, “Parameterizable content-addressable memory,” Xilinx Appli-
using additional hardware resources on FPGAs. The ideal cation Note XAPP1151, 2011.
case of this implementation can be seen from Table 2 and [15] W. Jiang, “Scalable ternary content addressable memory implementation
using FPGAs,” in Proceedings of the ninth ACM/IEEE symposium on
3 in case of D-TCAM size 128×36. The slices used in both Architectures for networking and communications systems. IEEE Press,
cases (with pipeline registers and without pipeline registers) 2013, pp. 71–82.
are the same (254), but the speed is improved by 52% (from [16] Z. Qian and M. Margala, “Low power RAM-based hierarchical CAM on
FPGA,” in ReConFigurable Computing and FPGAs (ReConFig), 2014
459 to 701 MHz). International Conference on. IEEE, 2014, pp. 1–4.
Table 4 summarizes the state-of-the-art FPGA-based [17] A. M. Abdelhadi, G. G. Lemieux, and L. Shannon, “Modular block-
TCAM architectures, which show that our proposed archi- ram-based longest-prefix match ternary content-addressable memories,”
in 2018 28th International Conference on Field Programmable Logic and
tecture improves the hardware resources by 60% in terms Applications (FPL). IEEE, 2018, pp. 243–2437.
of slices on FPGA. The throughput is improved from 29.8 [18] Z. Ullah, “LH-CAM: Logic-Based Higher Performance Binary CAM
Gb/s to 37.3 Gb/s by using the redundant FFs of the LUT-FF Architecture on FPGA,” IEEE Embedded Systems Letters, vol. 9, no. 2,
pairs as pipeline registers. The state-of-the-art TCAM design pp. 29–32, 2017.
[19] M. Irfan and Z. Ullah, “G-AETCAM: Gate-Based Area-Efficient Ternary
[11] has a throughput of 15.36 Gb/s while our proposed Content-Addressable Memory on FPGA,” IEEE Access, vol. 5, pp.
architecture significantly improves it to 37.3 Gb/s, which 20 785–20 790, 2017.
is 58.8% improvement because of the novel TCAM design [20] H. M. Kittur et al., “Precharge-Free, Low-Power Content-Addressable
Memory,” IEEE Transactions on Very Large Scale Integration (VLSI)
using LUT-FF pairs. Systems, vol. 24, no. 8, pp. 2614–2621, 2016.
[21] S. Mishra and A. Dandapat, “Energy-efficient adaptive match-line con-
VI. CONCLUSIONS AND FUTURE WORK troller for large-scale associative storage,” IEEE Transactions on Circuits
and Systems II: Express Briefs, vol. 64, no. 6, pp. 710–714, 2017.
The proposed TCAM design efficiently utilizes the hardware [22] Z. Ullah, K. Ilgon, and S. Baeg, “Hybrid partitioned SRAM-based ternary
resources on FPGA and exploits the internal structure of content addressable memory,” IEEE Transactions on Circuits and Systems
slices inside FPGA. Pipelining improved the performance of I: Regular Papers, vol. 59, no. 12, pp. 2969–2979, 2012.
the TCAM, in terms of throughput, without extra hardware [23] Z. Ullah, M. K. Jaiswal, R. C. Cheung, and H. K. So, “UE-TCAM: An ultra
efficient SRAM-based TCAM,” in TENCON 2015-2015 IEEE Region 10
cost compared to the state-of-the-art FPGA-based TCAMs. Conference. IEEE, 2015, pp. 1–6.
Future work targets on further optimization of emulated [24] A. Ahmed, K. Park, and S. Baeg, “Resource-Efficient SRAM-Based
TCAM for networking applications which can more effi- Ternary Content Addressable Memory,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 25, no. 4, pp. 1583–1587, 2017.
ciently utilize the hardware resources on FPGAs. [25] C. A. Zerbini and J. M. Finochietto, “Performance evaluation of packet
classification on FPGA-based TCAM emulation architectures,” in Global
REFERENCES Communications Conference (GLOBECOM), 2012 IEEE. IEEE, 2012,
pp. 2766–2771.
[1] R. Karam, R. Puri, S. Ghosh, and S. Bhunia, “Emerging trends in design
[26] M. Somasundaram, “Circuits to generate a sequential index for an input
and applications of memory-based computing and content-addressable
number in a pre-defined list of numbers,” Dec. 26 2006, uS Patent
memories,” Proceedings of the IEEE, vol. 103, no. 8, pp. 1311–1330, 2015.
7,155,563.
[2] W. Yu, S. Sivakumar, and D. Pao, “Pseudo-TCAM: SRAM-based Archi-
[27] Z. Ullah, M. K. Jaiswal, and R. C. Cheung, “Z-TCAM: an SRAM-
tecture for Packet Classification in One Memory Access,” IEEE Network-
based architecture for TCAM,” IEEE Transactions on Very Large Scale
ing Letters, 2019.
Integration (VLSI) Systems, vol. 23, no. 2, pp. 402–406, 2015.
[3] X. Chen, H. Huo, J. Huan, and J. S. Vitter, “Efficient graph Similarity
Search in External Memory,” IEEE Access, vol. 5, pp. 4551–4560, 2017. [28] ——, “E-TCAM: An efficient SRAM-based architecture for TCAM,”
Circuits, Systems, and Signal Processing, vol. 33, no. 10, pp. 3123–3144,
[4] E. I. Guerra-Hernandez, A. Espinal, P. Batres-Mendoza, C. Garcia-
2014.
Capulin, R. Romero-Troncoso, and H. Rostro-Gonzalez, “A FPGA-based
neuromorphic locomotion system for multi-legged robots,” IEEE Access, [29] I. Ullah, Z. Ullah, and J.-A. Lee, “Efficient TCAM design based on
2017. multipumping-enabled multiported SRAM on FPGA,” IEEE Access,
[5] S. Mittal, “A survey of techniques for architecting tlbs,” Concurrency and vol. 6, pp. 19 940–19 947, 2018.
Computation: Practice and Experience, vol. 29, no. 10, p. e4061, 2017. [30] I. Ullah, Z. Ullah, U. Afzaal, and J.-A. Lee, “DURE: An Energy-
[6] O. Mujahid, Z. Ullah, H. Mahmood, and A. Hafeez, “Fast pattern recog- and Resource-Efficient TCAM architecture for FPGAs with Dynamic
nition through an LBP driven CAM on FPGA,” IEEE Access, vol. 6, pp. Updates,” IEEE Transactions on Very Large Scale Integration (VLSI)
39 525–39 531, 2018. Systems, 2019.
[7] Y. Yu, D. Belazzougui, C. Qian, and Q. Zhang, “Memory-Efficient and [31] M. Irfan and A. Ahmad, “Impact of initialization on gate-based area
Ultra-Fast Network Lookup and Forwarding using Othello Hashing,” efficient ternary content-addressable memory,” in 2018 International Con-
IEEE/ACM Transactions on Networking, 2018. ference on Computing, Electronics & Communications Engineering (iC-
[8] T.-S. Chen, D.-Y. Lee, T.-T. Liu, and A.-Y. Wu, “Dynamic Reconfigurable CECE). IEEE, 2018, pp. 328–332.
Ternary Content Addressable Memory for Openflow-Compliant Low- [32] H. Mahmood, Z. Ullah, O. Mujahid, I. Ullah, and A. Hafeez, “Beyond
Power Packet Processing,” IEEE Transactions on Circuits and Systems I: the limits of Typical Strategies: Resources Efficient FPGA-based TCAM,”
Regular Papers, vol. 63, no. 10, pp. 1661–1672, 2016. IEEE Embedded Systems Letters, 2018.
[9] A. Ullah, P. Reviriego, A. Sánchez-Macián, and J. A. Maestro, “Multiple [33] S. Pontarelli, P. Reviriego, and J. A. Maestro, “Parallel D-pipeline: a
Cell Upset Injection in BRAMs for Xilinx FPGAs,” IEEE Transactions on cuckoo hashing implementation for increased throughput,” IEEE Trans-
Device and Materials Reliability, vol. 18, no. 4, pp. 636–638, 2018. actions on Computers, vol. 65, no. 1, pp. 326–331, 2016.
[10] F. Altera, “10k embedded programmable logic device family data sheet,” [34] A. Bremler-Barr, Y. Harchol, D. Hay, and Y. Hel-Or, “Encoding short
2003. ranges in TCAM without expansion: Efficient algorithm and applications,”
[11] Xilinx, “Ternary Content Addressable Memory (TCAM) Search IP for IEEE/ACM Transactions on Networking, vol. 26, no. 2, pp. 835–850,
SDNet,” Xilinx Product Guide PG190, San Jose, USA, Nov, 2017. 2018.

VOLUME 4, 2016 9

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 4: Comparison with other TCAM Architectures. (Tech.: Technology in nm, Speed: Speed of the design in MHz,
Speed0 : Normalized speed of the design computed using (2), Slices: Number of slices used by the design on target FPGA,
Slices0 : Number of slices computed using (1) by combining BRAMs and Slices, Througput is in Gbits/s computed using (3))
FPGA Device BRAMs Slices Slices0 Speed Speed0 Throughput
TCAM Design TCAM Size
(Tech.) (36K) (#) (#) (MHz) (MHz) (Gb/s)
Xilinx Locke [14] 256 × 32 Virtex-5 (65 nm) 0 1406 — 100 163 5.2
HP-TCAM [22] 512 × 36 Virtex-6 (40 nm) 56 1637 2981 118 — 4.25
Z-CAM [27] 512 × 36 Virtex-6 (40 nm) 40 1116 2076 159 — 5.72
E-CAM [28] 512 × 36 Virtex-6 (40 nm) 40 1116 2076 164 — 5.77
UE-CAM [23] 512 × 36 Virtex-6 (40 nm) 32 913 1681 202 — 7.26
REST [24] 72 × 28 Kintex-7 (28 nm) 1 77 101 50 35 0.98
Heirarchical [16] 504 × 180 Virtex-6 (40 nm) 0 10067 — 109 — 19.26
DURE [30] 512 × 36 Virtex-6 (40 nm) 0 1668 — 335 — 12.06
Scalable [15] 1024 × 150 Virtex-7 (28 nm) 0 20526 — 199 139 20.9
Xilinx SDNet [11] 512 × 128 Virtex-7 (28 nm) 3 12011 12083 171 120 15.36
D-TCAM I 512 × 36 Virtex-6 (40 nm) 0 968 — 460 — 16.6
D-TCAM I 512 × 72 Virtex-6 (40 nm) 0 2357 — 214 — 15.4
D-TCAM II 512 × 144 Virtex-6 (40 nm) 0 4835 — 259 — 37.3

[35] F. Syed, Z. Ullah, and M. K. Jaiswal, “Fast Content Updating Algorithm DR. RAY C. C. CHEUNG received the BEng
for an SRAM based TCAM on FPGA,” IEEE Embedded Systems Letters, and MPhil degrees in computer engineering and
2017. computer science and engineering at the Chinese
[36] D.-Y. Lee, C.-C. Wang, and A.-Y. Wu, “Bundle-Updatable SRAM-Based University of Hong Kong (CUHK), Hong Kong,
TCAM Design for OpenFlow-Compliant Packet Processor,” IEEE Trans- in 1999 and 2001, respectively, and the PhD de-
actions on Very Large Scale Integration (VLSI) Systems, 2019. gree and DIC in computing at Imperial College
London, London, United Kingdom, in 2007. After
completing the PhD work, he received the Hong
Kong Croucher Foundation Fellowship for his
postdoctoral study in the Electrical Engineering
Department, University of California, Los Angeles (UCLA). In 2009, he
MUHAMMAD IRFAN received the B.Sc. degree worked as a visiting research fellow in the Department of Electrical Engi-
in electrical engineering from University of En- neering, Princeton University, Princeton, NJ.
gineering and Technology, Peshawar, Pakistan, in Currently, he is an associate professor in the Department of Electronic
2013 and the M.S. degree in electrical engineering Engineering, City University of Hong Kong (CityU). He is the author of
from Lahore University of Management Sciences, more than 150 journal and conference papers. His research team, CityU
Lahore, Pakistan, in 2016, respectively. Architecture Lab for Arithmetic and Security (CALAS), focuses on the
He has taught for two years in the Department following research topics: reconfigurable trusted computing, applied cryp-
of Electrical Engineering, CECOS University of tography, and high-performance biomedical VLSI designs. He is a member
IT & Emerging Sciences, Peshawar, Pakistan and of the IEEE.
is currently on leave from there. He is doing PhD
in Electrical Engineering at City University of Hong Kong, Hong Kong. His
research interests include FPGA-based digital systems designs, low power
computer architectures, and memory design.

DR. ZAHID ULLAH received the B.Sc. degree

(Hons.) in computer system engineering from the
University of Engineering and Technology, Pe-
shawar, Pakistan in 2006, the M.S. degree in elec-
tronic, electrical, control, and instrumentation en-
gineering from Hanyang University, South Korea,
in 2010, and the Ph.D. degree in electronic engi-
neering from the City University of Hong Kong,
Hong Kong, in 2014.
He is currently serving as an Associate Profes-
sor with the Department of Electrical Engineering, CECOS University of
IT & Emerging Sciences, Peshawar, Pakistan. He has authored prestigious
journal and conference papers and holds patents in his name in the field of
FPGA-based TCAM. His research interests include low power/high speed
CAM design on FPGA, low power/high speed VLSI design, and embedded
systems.

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

3.3.Về tăng throughput
No ratings yet
3.3.Về tăng throughput
10 pages
3.5.Về tiết kiệm điện năng
No ratings yet
3.5.Về tiết kiệm điện năng
4 pages
Ullah 2020
No ratings yet
Ullah 2020
4 pages
1 s2.0 S2090447923003672 Main
No ratings yet
1 s2.0 S2090447923003672 Main
11 pages
Reviriego 2019
No ratings yet
Reviriego 2019
5 pages
Algorithmic TCAM On FPGA With Data Collision Appro
No ratings yet
Algorithmic TCAM On FPGA With Data Collision Appro
8 pages
Ref 1
No ratings yet
Ref 1
1 page
Zerbini 2012
No ratings yet
Zerbini 2012
6 pages
10 1109@tvlsi 2020 2993168
No ratings yet
10 1109@tvlsi 2020 2993168
5 pages
Efficient TCAM Design Based On Multipumping-Enable
No ratings yet
Efficient TCAM Design Based On Multipumping-Enable
9 pages
4.4.Dùng Lý Thuyết Tập Hợp Để Encode
No ratings yet
4.4.Dùng Lý Thuyết Tập Hợp Để Encode
8 pages
Pseudo-TCAM: SRAM-Based Architecture For Packet Classification in One Memory Access
No ratings yet
Pseudo-TCAM: SRAM-Based Architecture For Packet Classification in One Memory Access
4 pages
3.4.Về giảm thời gian write data
No ratings yet
3.4.Về giảm thời gian write data
4 pages
Irfan 2017
No ratings yet
Irfan 2017
6 pages
0.efficient TCAM Design Based On Dual Port
No ratings yet
0.efficient TCAM Design Based On Dual Port
9 pages
Ullah 2015
No ratings yet
Ullah 2015
6 pages
Performance Improvement in SRAM EmulatedTCAM
No ratings yet
Performance Improvement in SRAM EmulatedTCAM
7 pages
10 1109@tvlsi 2019 2904105
No ratings yet
10 1109@tvlsi 2019 2904105
10 pages
C4CAM: A Compiler For CAM-based In-Memory Accelerators: Hamid Farzaneh João Paulo C. de Lima Mengyuan Li
No ratings yet
C4CAM: A Compiler For CAM-based In-Memory Accelerators: Hamid Farzaneh João Paulo C. de Lima Mengyuan Li
10 pages
Fast FPGA CAM Mapping & Update
No ratings yet
Fast FPGA CAM Mapping & Update
9 pages
Energy-Efficient Precharge-Free Ternary Content Addressable Memory (TCAM) For High Search Rate Applications
No ratings yet
Energy-Efficient Precharge-Free Ternary Content Addressable Memory (TCAM) For High Search Rate Applications
13 pages
Fpt2014 Paper
No ratings yet
Fpt2014 Paper
4 pages
Power-Saving Hybrid Cams For Parallel Ip Lookups: Heeyeol Yu and Rabi Mahapatra Uichin Lee
No ratings yet
Power-Saving Hybrid Cams For Parallel Ip Lookups: Heeyeol Yu and Rabi Mahapatra Uichin Lee
2 pages
TCAM Paper
No ratings yet
TCAM Paper
7 pages
Check Error
No ratings yet
Check Error
5 pages
17-Power Efficient and Fast Updating
No ratings yet
17-Power Efficient and Fast Updating
7 pages
PID6229471
No ratings yet
PID6229471
5 pages
Double Error Recognition
No ratings yet
Double Error Recognition
8 pages
Seminar
No ratings yet
Seminar
29 pages
Er-Tcam: A Soft-Error-Resilient Sram-Based Ternary Content-Addressable Memory For Fpgas
No ratings yet
Er-Tcam: A Soft-Error-Resilient Sram-Based Ternary Content-Addressable Memory For Fpgas
24 pages
Design of TCAM Architecture For Low Power and High Performance Applications (#385994) - 660370
No ratings yet
Design of TCAM Architecture For Low Power and High Performance Applications (#385994) - 660370
10 pages
CAM Basepaper
No ratings yet
CAM Basepaper
12 pages
3.EncodingShortRange ToN18
No ratings yet
3.EncodingShortRange ToN18
16 pages
CAM Verification with Layered Test Bench
No ratings yet
CAM Verification with Layered Test Bench
12 pages
Electronics 12 03691 v2
No ratings yet
Electronics 12 03691 v2
13 pages
Design Techniques and Test Methodology For Low-Power Tcams
No ratings yet
Design Techniques and Test Methodology For Low-Power Tcams
14 pages
ND IMPLEMENTATION OF CONTENT ADDRESSABLE MEMORY WITH RECONFIGURABLE MEMORY BANKS Ijariie4494
No ratings yet
ND IMPLEMENTATION OF CONTENT ADDRESSABLE MEMORY WITH RECONFIGURABLE MEMORY BANKS Ijariie4494
8 pages
Tcam A IPL U P P: Rchitecture For Ookup Sing Refix Roperties
No ratings yet
Tcam A IPL U P P: Rchitecture For Ookup Sing Refix Roperties
10 pages
Draft - Pham Quang Anh - Design Low Area and Enhanced Capacity 256x128 Bit TCAM For IPv6 Address - Ver7
No ratings yet
Draft - Pham Quang Anh - Design Low Area and Enhanced Capacity 256x128 Bit TCAM For IPv6 Address - Ver7
18 pages
The Analogy of Matchline Sensing Techniques For Content Addressable Memory (CAM)
No ratings yet
The Analogy of Matchline Sensing Techniques For Content Addressable Memory (CAM)
10 pages
Cams 2020
No ratings yet
Cams 2020
2 pages
Low-Area TCAM with DCR Scheme
No ratings yet
Low-Area TCAM with DCR Scheme
7 pages
Udp Fpga
No ratings yet
Udp Fpga
12 pages
CapCAM A Multilevel Capacitive Content Addressable Memory For High-Accuracy and High-Scalability Searc
No ratings yet
CapCAM A Multilevel Capacitive Content Addressable Memory For High-Accuracy and High-Scalability Searc
13 pages
Z Yeim-Kuanchang2006
No ratings yet
Z Yeim-Kuanchang2006
16 pages
M2CP2025 Group7 VinhVy ViettelHighTech
No ratings yet
M2CP2025 Group7 VinhVy ViettelHighTech
35 pages
Reconfigurable Computing Using Content Addressable Memory For Improved Performance and Resource Usage
No ratings yet
Reconfigurable Computing Using Content Addressable Memory For Improved Performance and Resource Usage
6 pages
Area-Efficient Near-Associative Memories On Fpgas: Udit Dhawan Udit@Seas - Upenn.Edu André Dehon
No ratings yet
Area-Efficient Near-Associative Memories On Fpgas: Udit Dhawan Udit@Seas - Upenn.Edu André Dehon
11 pages
TCAM-SSD: A Framework For Search-Based Computing in Solid-State Drives
No ratings yet
TCAM-SSD: A Framework For Search-Based Computing in Solid-State Drives
28 pages
2019 Time-Multiplexed FPGA Overlay Architectures - A Survey
No ratings yet
2019 Time-Multiplexed FPGA Overlay Architectures - A Survey
19 pages
1 s2.0 S1875389212015970 Main
No ratings yet
1 s2.0 S1875389212015970 Main
7 pages
A Low Power Hybrid Partition SRAM Based TCAM With A Parity Bit
No ratings yet
A Low Power Hybrid Partition SRAM Based TCAM With A Parity Bit
5 pages
Flash-Based MCAM for ML Efficiency
No ratings yet
Flash-Based MCAM for ML Efficiency
6 pages
Tcambased Distributed Parallel Packet Classification Algorithm W
No ratings yet
Tcambased Distributed Parallel Packet Classification Algorithm W
11 pages
SEE-MCAM Scalable Multi-Bit FeFET Content Addressable Memories For Energy Efficient Associative Search
No ratings yet
SEE-MCAM Scalable Multi-Bit FeFET Content Addressable Memories For Energy Efficient Associative Search
9 pages
Performance-Driven Simultaneous Partitioning and Routing For Multi-FPGA Systems
No ratings yet
Performance-Driven Simultaneous Partitioning and Routing For Multi-FPGA Systems
6 pages
Thesis Help for CAM Students
100% (2)
Thesis Help for CAM Students
8 pages
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Cam-Svv
No ratings yet
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Cam-Svv
20 pages
TRIO-TCAM An Area and Energy-Efficient Triple-State-in-Cell Ternary Content-Addressable Memory Architecture
No ratings yet
TRIO-TCAM An Area and Energy-Efficient Triple-State-in-Cell Ternary Content-Addressable Memory Architecture
14 pages
Ug1414 Vitis Ai
No ratings yet
Ug1414 Vitis Ai
214 pages
Mosys Uc Stellar 5g-Routing 210617
No ratings yet
Mosys Uc Stellar 5g-Routing 210617
3 pages
1.coretcam HB
No ratings yet
1.coretcam HB
29 pages
Field-Effect Transistors: Microelectronic Circuit Design, Richard C. Jaeger, Travis N. Blalock
No ratings yet
Field-Effect Transistors: Microelectronic Circuit Design, Richard C. Jaeger, Travis N. Blalock
38 pages
Ansible Inventory Management Guide
No ratings yet
Ansible Inventory Management Guide
38 pages
ShareX Log: Config and Hotkey Setup
No ratings yet
ShareX Log: Config and Hotkey Setup
23 pages
Ishu Tefnut Research
No ratings yet
Ishu Tefnut Research
8 pages
End Marks
No ratings yet
End Marks
4 pages
Web Development Exam Guide
No ratings yet
Web Development Exam Guide
1 page
IP PROJECT On t20 Analusis
No ratings yet
IP PROJECT On t20 Analusis
25 pages
HDF5 tutorialNUG2010
No ratings yet
HDF5 tutorialNUG2010
112 pages
The Blood and The Glory
83% (6)
The Blood and The Glory
153 pages
The Number of The Nouns: Pentru A Forma Pluralul Substantivelor Se Adauga Consoana S La Forma de Singular
No ratings yet
The Number of The Nouns: Pentru A Forma Pluralul Substantivelor Se Adauga Consoana S La Forma de Singular
3 pages
23-MATH1315 - Xác suất và thống kê
No ratings yet
23-MATH1315 - Xác suất và thống kê
19 pages
Para sa Mga Batang Mag-aaral
No ratings yet
Para sa Mga Batang Mag-aaral
5 pages
OS/browser Related Issues: Plugin - Load - Flash - Only False
No ratings yet
OS/browser Related Issues: Plugin - Load - Flash - Only False
3 pages
Vikan G. Sacred Images and Sacred Power in Byzantium (2003)
No ratings yet
Vikan G. Sacred Images and Sacred Power in Byzantium (2003)
316 pages
JFET Presentation 1
No ratings yet
JFET Presentation 1
14 pages
Building Serverless Analytics Pipelines With AWS Glue - Tom McMeekin-1
No ratings yet
Building Serverless Analytics Pipelines With AWS Glue - Tom McMeekin-1
39 pages
Renouncer and Householder in Early Buddhism
No ratings yet
Renouncer and Householder in Early Buddhism
15 pages
4am 1st term exam النموذج 28
No ratings yet
4am 1st term exam النموذج 28
3 pages
PET and FCE Exam Training Guide
No ratings yet
PET and FCE Exam Training Guide
29 pages
AMPL/CPLEX for OR Students
No ratings yet
AMPL/CPLEX for OR Students
22 pages
Android System Shutdown Logs
No ratings yet
Android System Shutdown Logs
5 pages
Communication Systems Through Arts (Fine and Applied Arts) : Sharaibi Silvanus Ajibola
No ratings yet
Communication Systems Through Arts (Fine and Applied Arts) : Sharaibi Silvanus Ajibola
4 pages
86 Confucius Quotes
No ratings yet
86 Confucius Quotes
3 pages
Iwe Asiri Agbara Aditu Volii Complete. PDF 5
No ratings yet
Iwe Asiri Agbara Aditu Volii Complete. PDF 5
1 page
Views of Language (1) Unit 2
No ratings yet
Views of Language (1) Unit 2
1 page
Past Simple Verb Practice
No ratings yet
Past Simple Verb Practice
1 page
Trials of Apollo-English
No ratings yet
Trials of Apollo-English
7 pages
Pedagogy 1
No ratings yet
Pedagogy 1
5 pages
BAPI Simulation & Finding BAPI
No ratings yet
BAPI Simulation & Finding BAPI
5 pages
Standards Based Curriculum
100% (4)
Standards Based Curriculum
19 pages
Residential Cinema
No ratings yet
Residential Cinema
6 pages

1.2A High-Performance Distributed RAM Based TCAM Arch

Uploaded by

1.2A High-Performance Distributed RAM Based TCAM Arch

Uploaded by

This article has been accepted for publication in a future issue of this journal, but has not been

I. INTRODUCTION e.g., software defined networks (SDNs) [8]. A routing table

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

works (SDNs). The presence of TCAM in FPGAs is crucial

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: Terminologies (Notations for D-TCAM)

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[11:6] block [63:0]

Data (64 x 64) 64 E 6 (b)

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

6 D-CAM 64 Stage 1 Pipeline Stage 2

[127:64] LUT-FF pair

the form of slices to implement the distributed RAM-based

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Algorithm 2 Searching of D-TCAM for potential match V. PERFORMANCE EVALUATION

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Without Pipeline Registers With Pipeline Registers

Slices Used (#)

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

DR. ZAHID ULLAH received the B.Sc. degree

You might also like