[go: up one dir, main page]

0% found this document useful (0 votes)
232 views37 pages

Comprehensive Survey SmartNIC

Uploaded by

Aminudin Khalid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
232 views37 pages

Comprehensive Survey SmartNIC

Uploaded by

Aminudin Khalid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

1

A Comprehensive Survey on SmartNICs: Architectures,


Development Models, Applications, and Research Directions
Elie Kfoury, Samia Choueiri, Ali Mazloum, Ali AlSabeh, Jose Gomez, and Jorge Crichigno

Abstract—The end of Moore’s Law and Dennard Scaling 105

Processor performance (vs. VAX11/780)


has slowed processor improvements in the past decade. While
multi-core processors have improved performance, they are
limited by the application’s level of parallelism, as prescribed 104
by Amdahl’s Law. This has led to the emergence of domain-
arXiv:2405.09499v1 [cs.NI] 15 May 2024

specific processors that specialize in a narrow range of functions. 103


Smart Network Interface Cards (SmartNICs) can be seen as an 23% 12% 3.5%
evolutionary technology that combines heterogeneous domain- /year /year /year
specific processors and general-purpose cores to offload infras- 102 s 's
ar
d hl' re
tructure tasks. Despite the impressive advantages of SmartNICs 52% nn ling da oo
/year De sca nds Am law it M lawds
and their importance in modern networks, the literature has been en
missing a comprehensive survey. To this end, this paper provides
101 e lim

a background encompassing an overview of the evolution of


25%
NICs from basic to SmartNICs, describing their architectures, 100 /year
development environments, and advantages over legacy NICs. 1980 1985 1990 1995 2000 2005 2010 2015
The paper then presents a comprehensive taxonomy of ap- Year
plications offloaded to SmartNICs, covering network, security,
storage, and machine learning functions. Challenges associated Fig. 1. Growth in processor performance over 40 years, relative to the VAX
with SmartNIC development and deployment are discussed, along 11/780 as measured by the SPEC integer benchmarks. Reproduced from [4].
with current initiatives and open research issues.
Index Terms—SmartNIC, Data Processing Unit (DPU), In-
frastructure Processing Unit (IPU), Moore’s law, application law has recently ended, resulting in the improvements of
offloading, P4, Application Specific Integrated Circuit (ASIC), processors to slow down further.
Field Programmable Gate Array (FPGA). In today’s world, most data arrive at compute locations
as packets from the networks. The traditional communica-
tion channel connecting networks and hosts is the Network
I. I NTRODUCTION
Interface Card (NIC). In the past NICs were simple hardware-
N 1965, Gordon Moore predicted that the number of based devices that received the packets from the network and
I transistors per chip would double every year [1], which
was updated in 1975 to every two years [2]. In 1974, Robert
placed them in memory at the host. Packets would then wait
for processing time by the general-purpose processor at the
Dennard noted that power density was constant for a given host [6]. Although this model was successful for a long time,
silicon area even as the number of transistors increased be- it has several challenges in current environments:
cause of the smaller dimensions of each transistor. Transistors • As Moore’s law and Denning scaling ended, simply adding
used less power and the performance of integrated circuits was more processing capacity to cope with the increasing amount
enhanced by packing more transistors per chip [3]. The ability of traffic is no longer an option.
of the microprocessor, or simply processor, to exploit the • A large percentage of tasks executed by the processors relate
advances in integrated circuits enabled impressive performance to the infrastructure rather than to the user applications, e.g.,
improvements, see Fig. 1 [4]. Unfortunately, in 2003, the limits TCP/IP tasks, encryption, compression, etc. Such operations
of power due to the end of Dennard Scaling slowed processor use valuable processor cycles that may be used for applica-
performance to 23%. This observation forced the industry tion tasks instead.
to use multiple processors per chip, referred to as cores. • Historical software solutions for packet-related tasks are not
While multi-core processors helped improve performance, they efficient in terms of throughput, latency, and energy. While
have slowed down in the last decade, because of the natural in the past inefficient software solutions were mitigated by
limits prescribed by Amdahl’s Law [5]: there is a maximum the relentless progress of the (hardware) processors, today’s
performance benefit from parallelism, as applications also have solutions can no longer rely on future improvements in
tasks that must be executed sequentially. Additionally, Moore’s processor performance.
• The explosion of network traffic is accompanied by im-
E. Kfoury, S. Choueiri, A. Mazloum, A. AlSabeh, J. Gomez and J.
Crichigno are with the Integrated Information Technology Department at the pressive improvement in the physical layer and bandwidth
University of South Carolina, USA ({ekfoury, choueiri, amazloum, aalsabeh, capacity. As network traffic arrives at servers at higher
gomezgaj}@email.sc.edu, jcrichigno@cec.sc.edu. rates, processors are unable to process it on time, and the
This work has been submitted to the IEEE for possible publication. Copyright
may be transferred without notice, after which this version may no longer be gap between processor performance and bandwidth is only
accessible. increasing, see Fig. 2.
2

100 A Comprehensive Survey on SmartNICs:


6000 Processor speed (MIPS/socket)
Architectures, Development Models, Applications, and Research Directions
Network port speed (Gbps) Gap
5000
The end of Moore’s law and Dennard Scaling
4000 Section I: Introduction Rise of SmartNICs and domain-specific accelerators
50 Paper contributions and organization
3000
2000 Section II: Related SmartNIC related surveys
Surveys Comparison between related work and this survey
1000 10
Traditional, offload, and SmartNIC
Components of a SmartNIC
2012 2014 2016 2018 2020 2022 Section III: Evolution of
SmartNIC benefits
Year Network Interface Cards
Comparison between the generation of NICs

Fig. 2. Network port speeds (solid line, right y-axis) versus processor speed Discrete SmartNICs
(dotted line, left y-axis) over the years. Port speeds are increasing while Section IV: SmartNIC SoC SmartNICs
Architectures On-path and off-path SmartNICs
processor speeds are plateauing. Reproduced from [12].
Commercial SmartNICs

Section V: Dev. Tools Programmable packet processor: P4, FPGA, P4-FPGA


Since the Dennard Scaling ended and the energy budget and Frameworks CPU cores: DPDK, BPF, XDP, P4-DPDK, P4-eBPF, etc.
NIC switch: RTE_flow, TC_FLOWER
is no longer increasing, many consider that the only path
left to improve energy, performance, and cost is by using Network: switching, tunneling, observability, etc.
domain-specific processors rather than power-hungry general- Section VI-X: Security: firewall, IDS/IPS, DPI, IPSec/TLS, etc.
Applications Storage: NVMe-oF initiator/target, compression
purpose processors. SmartNICs can be seen as a revolutionary Compute: machine learning, key-value stores, etc.
technology developed to address the challenges listed above
by combining heterogeneous domain-specific processors that Section XI: Challenges Challenges, current initiatives, and future work
and Research Trends Heterogeneous SDKs and architectures, performance
specialize in a narrow range of infrastructure tasks. These in- unpredictability, complex function offloading, etc.
clude compression/decompression processors, programmable
pipelines, encryption/decryption processors, and others. Smart-
Fig. 3. Paper roadmap.
NICs also include general-purpose processors which are used
for managing the system, aiding the domain-specific proces-
sors, and enabling users to run control-plane applications. In NSX -integral components for virtualizing High Performance
the context of SmartNICs, the terms accelerators and engines Computing (HPC) environments- can now be effectively of-
are also used to refer to domain-specific processors [7], [8]. floaded onto SmartNICs [18]. Palo Alto Networks, a lead-
Note that the development of domain-specific processors has ing Next-Generation Firewall (NGFW) vendor, introduced
been successfully used in several domains, including graphics the “Intelligent Traffic Offload” service [19]; this service
in the 2000s –Graphics Processing Units (GPUs)–, machine offloads firewall functions to SmartNICs. Juniper Networks’
learning in mid 2010s –Tensor Processor Units (TPUs)–, net- virtual router/firewall can also be offloaded to SmartNICs
working in the late 2010s –Network Processor Units (NPUs) [20]. Telecommunication operators are increasingly migrating
that adhere to architecture models such as the Protocol Inde- their core services to run on SmartNICs [21]. Serverless and
pendent Switch Architecture (PISA)–, and genomics in 2018 edge computing workloads, including Machine Learning (ML)
[9]–[11]. training and inference, can be accelerated using SmartNICs
The momentum of SmartNICs is reflected in the global [22], [23]. Testbeds such as FABRIC [24] and GEANT [25],
Information Technology (IT) ecosystem. Hyperscalers such used worldwide for fundamental research, rely on SmartNICs
as Google, Amazon, and Microsoft are designing their own and other programmable devices to allow experimenters to
SmartNICs to run infrastructure functions and optimize rev- program the data path behavior and process network traffic in
enue and performance [13]–[15]. Manufacturers such as Intel, novel ways at line rate [26], [27].
NVIDIA, and AMD are emphasizing the development of
SmartNICs for a broad market range, offering Systems on a
Chip (SoCs) with programmable domain-specific processors A. Paper Contributions
for security, networks, storage, and telemetry [16]. Cloud Despite the increasing interest in SmartNICs, prior research
systems such as the Monterey project are redefining cloud has only partially covered this technology. As shown in Table
architectures by incorporating SmartNICs to run storage, net- I, there is currently no updated and comprehensive material
works, and security services, resulting in substantial improve- on SmartNICs. This paper addresses this gap by providing
ment in performance while leaving more processor cycles for an overview of the evolution of NICs, starting from tradi-
user applications [7], [8]. Research and education networks tional basic NICs to SmartNICs. It describes the hardware
(RENs) such as the Energy Sciences Network (ESnet) –the architectures, technologies, and software development envi-
high-performance network that carries traffic for the U.S. ronments used with SmartNICs, as well as the advantages that
Department of Energy and research organizations– are up- SmartNICs offer over legacy NICs. The paper then proposes a
grading their infrastructures with SmartNICs to enable data- taxonomy of the functions and applications being offloaded to
intensive science [17]. Software vendors are also offloading SmartNICs, illustrating their advantages over the conventional
their solutions to SmartNICs; VMware’s ESXi, vCenter, and method of executing such applications. Additionally, the pa-
3

TABLE I
C OMPARISON WITH RELATED SURVEYS .

Evolution and Architectures Development Applications and offloaded Comparisons Challenges and Research trends
Paper
definition and models environments workloads taxonomy with regular NICs discussions and directions
[28] d qd qd d d qd qd
[29] d q
d q
d d d dq q
d
[30] q
d q
d q
d dq d dq q
d
[31] d q
d q
d dq d dq q
d
[32] q
d q
d q
d dq dq dq q
d
[33] d d q
d dq d dq q
d
[34] q
d q
d q
d dq d dq q
d
This t t t t t t t
survey
tCovered in this survey dNot covered in this survey dqPartially covered in this survey

per discusses the challenges associated with SmartNICs and packet processing in high-speed networks and then delves
concludes by discussing future perspectives and open research into exploring the different classes of packet accelerators.
issues. For the software-based packet accelerators, the survey mainly
describes and analyzes Data Plane Development Kit (DPDK)
B. Paper Organization [35], PF RING [36], NetSlices [37], and Netmap [38]. For
the hardware-based packet accelerators, it focuses mainly
The road map of this survey is depicted in Fig. 3. Section on leveraging GPUs and Field Programmable Gate Arrays
II compares existing surveys on SmartNICs and related tech- (FPGAs) for optimized and efficient packet processing. The
nologies and demonstrates the novelty of this work. Section survey does not cover the latest generation of SmartNICs that
III presents an overview of the evolution of NICs, from tradi- include CPU cores and domain-specific accelerators. Also, the
tional basic NICs to SmartNICs. It describes the components survey does not cover the applications or the infrastructure
of SmartNICs and their benefits compared to legacy NICs. workloads that can be offloaded to SmartNICs.
Section IV describes the SmartNICs hardware architectures.
Section V describes the tools, frameworks, and development Freitas et al. [29] describe multiple packet processing accel-
environments for SmartNICs, both open-source and vendor- eration techniques. The survey focuses on packet processing
specific. Section VI provides a taxonomy of the applications in Linux environments. It categorizes the packet processing
and infrastructure workloads that are offloaded to SmartNICs. acceleration into hardware, software, and virtualization-based.
The subsequent sections (Section VII-X) describe the security, For each category, the survey offers background information
network, storage, and compute functions. Section XI lists and discusses a simple use case. The survey also provides dis-
challenges associated with SmartNICs. It then discusses cur- cussions on the host resource usage efficiency, the high packet
rent initiatives that overcome the challenges and provides a rate, the system security, and the flexibility/expandability. The
reflection on open research issues. Section XII concludes the survey briefly mentions programmable NICs (another term
paper. The abbreviations used in this article are summarized used for SmartNICs) and their role in accelerating packet
in Table XII, at the end of the article. processing. It does not cover their development environments,
hardware architectures, and the applications/workloads that
can be offloaded.
II. R ELATED S URVEYS
Linguaglossa et al. [30] focus on software and hardware
Despite the widespread interest from both industry and technologies that accelerate Network Function Virtualization
academia in SmartNICs, there is a noticeable absence of a (NFV). It categorizes software acceleration technologies into
comprehensive survey that adequately explores their potential pure software acceleration and hardware-supported functions
and ongoing research endeavors. The existing surveys that are in software. It also provides a brief overview of the software
closest to this paper can be divided into 1) packet processing acceleration ecosystem which includes DPDK, XDP, Netmap,
acceleration; and 2) programmable data planes. and PF RING. For the hardware technologies, it discusses the
offloading functions of traditional NICs (e.g., CRC calculation,
A. Surveys on Packet Processing Acceleration checksum computation, and TCP Offload Engine (TOE)) and
The existing surveys in this category discuss the advantages a subset of the hardware architectures of SmartNICs. Then,
of accelerating packet processing, particularly with software it provides a brief overview of the programming abstractions
technologies. However, while SmartNICs are occasionally in SmartNICs. The survey has the following limitations: 1) it
mentioned in these surveys, they fail to delve into crucial does not cover all the hardware architectures; 2) it does not
aspects such as their potential, architectures, applications, etc. cover the development tools and environments; and 3) it does
Cerović et al. [28] discuss various software-based and not cover the applications and infrastructure workloads that
hardware-based packet accelerators. The survey focuses on can be offloaded to SmartNICs.
server-class networking. It first starts by explaining the prob- Fei et al. [31] also focus on NFV acceleration. The survey
lems associated with using the standard Linux kernel for classifies NVF acceleration into three high-level categories:
4

Packet buffer Packet buffer Basic Packet buffer Execution


accelerators engines
TOE CPU cores
Traffic
RX Ethernet RX Ethernet RX
Traffic

Traffic

Traffic
Checksum Manager
and serial and Prog. pipeline
TX TX TX or
IO serial IO NIC switch
Others Domain-specific

DMA engine DMA engine DMA engine

PCIe PCIe PCIe


(a) Traditional NIC (b) Offload NIC (c) SmartNIC

Fig. 4. Main functional blocks of (a) traditional NICs, (b) offload NICs, and (c) SmartNICs.

computation, communication, and traffic steering. Under the as a Service (NAaaS). While discussing hardware-based tech-
computation category, the survey discusses some hardware of- niques, the survey only focuses on RDMA. The authors only
floading architectures which include SmartNICs. The remain- describe SmartNICs as an enabling technology for RDMA and
ing of the survey focuses on software acceleration and how virtualization without describing their different architectures,
to tune the system to achieve better performance. The survey development environments, of their different capabilities for
has the following limitations: 1) it does not cover the hardware enhancing network acceleration.
architectures used by the latest generation of SmartNICs; 2)
it does not cover the development tools and environments; 3)
B. Surveys on Programmable Data Planes
it does not cover the applications and workloads that can be
offloaded to the SmartNIC. Numerous surveys have covered the general aspects of
Shantharama et al. [32] provide a comprehensive survey on programmable data planes in the past few years [39]–[43].
softwarized NFs. The survey classifies the CPU, the memory, Some surveys focused on specific areas such as network
and the interconnects as the three main enabling technologies security [44]–[46], ML training and inference [47], [48], TCP
for NFVs. With low-level details, the survey explains how each enhancements [49], virtualization and cloud computing [50],
class operates and how it can be optimized to provide better 5G and telecommunications [51], rerouting and fast recovery
virtualization support. It also discusses the use of dedicated [52], [53]. All these surveys have discussed some applica-
hardware accelerators (FPGAs, ASICs, etc.) to improve the tions developed on SmartNICs. However, their focus is on
performance of softwarized NFs. The survey briefly describes programmable switches (e.g., Intel’s Tofino). Recent advances
some of the applications offloaded to SmartNICS without in SmartNICs are not covered in these surveys.
providing a sufficient overview of the technology, the different
available development environments, or the latest enhancement C. Novelty
in the field of SmartNICs.
Table I summarizes the topics and the features described
Vieira et al. [33] only focus on the extended Berkeley Packet
in the related surveys. It also highlights how this paper
Filter (eBPF) and the eXpress Data Path (XDP) software
differs from the existing surveys. To the best of the authors’
acceleration techniques. The survey illustrates the process of
knowledge, this work is the first to exhaustively explore the
enhancing packet processing speed by running eBPF-based
whole SmartNIC ecosystem. Unlike previous surveys, this
applications in the XDP layer of the Linux kernel network
survey provides in-depth discussions on the evolution and
stack. It presents a tutorial that includes the compilation
definition of SmartNICs, the common architectures used by
and verification processes, the program structure, the required
various SmartNIC models in the market, and the development
tools, and walk-through example programs. Although the
environments (both open source and proprietary). It then
authors mentioned SmartNICs as a target platform for eBPF
provides a detailed taxonomy covering the applications that are
applications, it does not cover the available architectures of
offloaded to SmartNICs, while highlighting the performance
SmartNICs, the development environments, or the applications
gains compared to regular NICs. The survey also presents
that can be offloaded to SmartNICs.
the challenges associated with programming and deploying
Rosa et al. [34] describe multiple software and hardware SmartNICs, as well as the current and future research trends.
techniques to enhance packet processing speed in the cloud.
While discussing software-based techniques, the survey fo-
cuses on zero-copy data transfers, minimal context switching, III. E VOLUTION OF N ETWORK I NTERFACE C ARDS (NIC S )
and asynchronous processing as the core techniques for net- There are three main generations of NICs: traditional NICs,
work acceleration. After that, it shows how DPDK, XDP, and offload NICs, and SmartNICs. Fig. 4 shows a simplified
eBPF are used in the cloud to enable Network Acceleration diagram of the three NICs.
5

A. Traditional NICs Host Host


Traditional NICs (Fig. 4 (a)) are devices that implement Core Core Core Core Core Core Core Core
basic physical and data-link layer services. These services Core Core Core Core Core Core Core Core
include serializing/deserializing frames, managing link access,
Security Storage User apps.
and providing error detection. Typically, these services are
executed by a fixed-function component residing on a special- Network User apps. SmartNIC Core
purpose chip within the NIC. On the sending side, the fixed- Traditional NIC Security Storage Core
function component accepts a datagram created by the host, Core
Forward. Network Forward.
encapsulates it in a link-layer frame, and then transmits the Core
frame into the communication link, following the link-access
(a) Traditional NIC (c) SmartNIC
protocol. On the receiving side, the fixed-function component
receives the frame and forwards it to the host via a Peripheral Fig. 5. In a deployment with a traditional NIC (a), the host CPU cores execute
Component Interconnect Express (PCIe) card. infrastructure functions and user applications. With SmartNICs (b), the host
CPU cores solely execute user applications. The SmartNIC CPU cores assist
other accelerators in executing infrastructure functions.
B. Offload NICs
Offload NICs (Fig. 4 (b)) incorporate hardware in the form NIVDIA’s BlueField-2 [54]) use a multi-core CPU processor
of ASICs and/or FPGAs to execute basic “infrastructure” for custom packet processing. Other SmartNICs (e.g., AMD
functions1 that were previously handled by the host. The goal Pensando DSC [55]) use embedded flow engines running a P4
is to free up cycles in the main host’s CPU for application programmable Application Specific Integrated Circuit (ASIC)
(end-user) tasks rather than infrastructure tasks. Examples of pipeline. Other SmartNICs (e.g., AMD Xilinx SN1000 [56])
such functions include: use an FPGA for custom packet processing. The domain-
• Basic packet processing: parsing and reassembling IP data- specific processors are optimized to provide high-performance
grams, computing IP checksum, encapsulating and de- and energy-efficient processing for a specific set of functions
encapsulating TCP segments. (e.g., cryptography). The execution engines have a memory
• Managing TCP connections on the NIC: connection es- hierarchy that typically consists of an L1 cache, scratchpad,
tablishment, checksum and sequence number calculations, L2 cache, and Dynamic RAM (DRAM).
TOE, sliding window calculations for segment acknowledg- The SmartNICs have general-purpose CPU cores for ex-
ment and congestion control, among others. ecuting control plane functions. The CPU cores also enable
• Other functions that manipulate TCP/IP header fields to SmartNICs to function autonomously and have their own
implement basic filtering and traffic classification. Operating System (OS), such as Ubuntu Linux, which is
Offload NICs allow end users to perform pre-programmed independent of the host system in which they are running.
functions on the NIC. However, they do not support the The programmable components of a SmartNIC allows it to
creation and execution of custom applications directly on execute infrastructure functions, without involving the CPU of
the NIC. Even with full transport layer offload, application the host. Consider Fig. 5. In a deployment with a traditional
protocols still need to be implemented on the host processor NIC (a), the host CPU cores execute infrastructure functions
[9]. (typically classified as network, security, and storage) and user
applications; with SmartNICs (b), the host CPU cores solely
C. SmartNICs execute user applications. The SmartNIC CPU cores assist
other domain-specific accelerators in executing infrastructure
The definition of a SmartNIC is not widely agreed upon. functions.
Traditionally, NICs that performed functions beyond basic 1) Custom Packet Processing: SmartNICs enable the de-
packet processing were labeled as SmartNICs. Unless other- velopers to devise custom packet processing on its execution
wise noted, the term SmartNIC in this survey refers to the engines. The packet processing logic can be implemented on
latest generation of NICs, also known as SoC SmartNICs, CPU cores, FPGAs, or programmable ASIC pipelines. Regard-
Infrastructure Processing Units (IPUs)2 , Data Processing Units less of the hardware architecture used by the SmartNIC, its
(DPUs), and Auxiliary Processing Units (xPUs)3 . packet processing engines include the following components:
Fig. 4 (c) shows a simplified diagram of a SmartNIC. The
SmartNIC includes a Traffic Manager (TM) or a NIC switch
that performs Quality of Service (QoS) and steers traffic to the
NIC execution engines. The NIC execution engines consist of
Packets

Packets

...
a combination of processors used for custom packet processing
and other domain-specific functions. Some SmartNICs (e.g.,
Programmable Stage 1 Stage 2 Stage n Programmable
1 In parser Programmable match-action pipeline deparser
this context, infrastructure functions refer to tasks that facilitate data
movement to the host and do not involve application data. State Memory ALU
2 IPU is the terminology used by Intel.
3 xPU is used by the Storage Networking Industry Association (SNIA)
community. Fig. 6. Programmable Pipeline.
6

programmable parser, programmable match-action pipeline, TABLE II


and programmable deparser. These components closely re- F EATURES , T RADITIONAL NIC S , O FFLOAD NIC S , AND S MART NIC S .
semble those of the PISA architecture [57], see Fig. 6.
Traditional Offload
The programmable parser allows the developer to define Feature
NIC NIC
SmartNIC
headers based on custom or standard protocols and parse Infrastructure functions
Low Medium High
them. It is represented as a state machine. The programmable separation
match-action pipeline carries out operations on packet headers Security Isolation Low Low High
General-purpose CPU No No Yes
and intermediate results. Each match-action stage comprises Domain-specific processors Low Medium High
multiple memory blocks (such as tables and registers) and Customization of data plane No Low High
Arithmetic Logic Units (ALUs) that enable concurrent lookups Flexibility to define new
No Low High
protocols
and actions. To address data dependencies and ensure coherent Innovation Low Medium High
processing, stages are organized sequentially. After processing Standardized models Yes Yes No
the packet, the programmable deparser reconstructs packet Technology maturity High High Medium
headers and serializes them for transmission.
Although various vendors have their own models for pro-
gramming the pipeline, there is a common goal across the • Infrastructure functions will run more efficiently on the CPU
industry to make them P4-programmable [58]. P4, originally cores of the SmartNIC than on the CPU cores of the host
designed as a domain-specific language for programmable data since they will be separated from other compute-intensive
plane switches, has gained popularity in programming packet workloads used by user applications.
data paths due to its simplicity and versatility. • The security is improved because the infrastructure func-
2) Domain-specific Packet Processing: Infrastructure tasks tions will be completely isolated from the host.
can be broadly categorized into network functions, security • The ASIC/FPGA execution engines have limitations on the
functions, and storage functions. These tasks are integral complexity of operations to be performed on the packets.
to various networks, including data centers, cloud environ- Such limitations stem from the fact that packets must be
ments, enterprise networks, and campus networks. Given the processed as quickly as possible to sustain line rate. Having
specificity of these functions, some are optimized by being CPU cores on the SmartNIC can help in executing such
implemented directly in hardware to enhance their speed and functions, at the cost of an increase in the latency.
efficiency. For instance, Transport Layer Security (TLS), a
widely used protocol for encrypting application payloads and
authenticating users, involves functions like encrypting and
decrypting data. Recognizing the repetitive nature of these D. SmartNICs Benefits
operations, it is practical to hardcode them into hardware.
Hardware-based crypto processors, which have been utilized SmartNICs offer a wide range of features and benefits that
for some time, are examples of domain-specific processors solve modern network challenges.
incorporated into SmartNICs. In addition to improving the • Infrastructure offloads: Data center infrastructure tasks cur-
speed and efficiency of their respective functions, domain- rently consume up to 30% of processing capacity [59].
specific processors free up CPU cores on the host for other This phenomenon is commonly known as the Data Center
computing tasks. Tax. By offloading these tasks to the SmartNIC, the freed-
Other examples of domain-specific processors include reg- up 30% of processing capacity becomes available for user
ular expression (RegEx) used for tasks requiring Deep Packet applications. This optimization can significantly increase
Inspection (DPI), Non-Volatile Memory Host Controller over revenue opportunities for cloud providers. This is the main
Fabrics (NVMe-oF) for remote storage, data compression, data reason why hyperscalers are among the early adopters of
deduplication, Remote Direct Memory Access (RDMA), etc. this technology.
The application sections of this survey (Section VIII) will • Application acceleration: By incorporating hardware-based
explore more specific use cases that leverage these domain- accelerators, SmartNICs demonstrate superior performance
specific accelerators for various applications. per watt compared to host-based applications. This results
3) Control Plane and Management: SmartNICs incorpo- in reduced latency, enhancing overall efficiency.
rate CPU cores for running control plane functions and for • Agility and reprogrammability: The process of developing
managing the SmartNIC. The CPU cores can also be used new silicon is time-consuming, expensive, and requires
for implementing functions that do not fit in ASIC/FPGA thorough testing. By the time this cycle is completed, rapid
execution engines. The CPU cores are typically ARM or technological advancements may have already rendered the
MIPS-based. Some advantages of incorporating CPU cores hardware obsolete. SmartNICs address this challenge by of-
within the SmartNIC are: fering programmable components, allowing for adaptability
• Certain infrastructure functions (e.g., key distribution for and timely updates in response to changing technological
TLS sessions) require execution in the CPU. The SmartNIC needs.
CPU cores can be utilized to perform these functions. This • Security isolation: SmartNICs enhance security by isolating
alleviates the burden on the host’s CPU cores, allowing them the execution of infrastructure functions from the server
to focus on executing user application functions. execution environments.
7

SmartNIC Architectures
Host cores

Host cores
System on Chip (SoC) NIC cores
Discrete

Traffic manager NIC switch


ASIC + CPU ASIC

TX/RX ports TX/RX ports NIC cores


FPGA + CPU ASIC + FPGA

FPGA Traffic Traffic


(a) On-path (a) Off-path

Fig. 7. Taxonomy of SmartNIC architectures based on SoC and discrete


Fig. 8. On-path and off-path SmartNICs.
categories.

E. Comparison of Traditional, Offload, and SmartNICs A. Discrete SmartNICs

Table II contrasts the main characteristics of traditional, of- Hardware implementations come with tradeoffs in terms of
fload, and SmartNICs. In the latter, the infrastructure functions cost, programming simplicity, and adaptability. While an ASIC
are separated from the user applications; this isolation im- offers cost-effectiveness and optimal price performance, its
proves security by protecting the user applications on the host. flexibility is limited. ASIC-based SmartNICs feature a pro-
The separation is possible due to the presence of CPU cores grammable data path that is relatively straightforward to con-
and domain-specific accelerators on the SmartNIC. Moreover, figure, yet this programmability is constrained by predefined
the data plane (i.e., packet processing) of the SmartNIC is functions within the ASIC, leading to potential limitations
customizable and is defined by the developer’s code; this in supporting certain workloads. In contrast, an FPGA-based
provides flexibility in defining and processing new protocols SmartNIC is exceptionally programmable. Given sufficient
as well as innovating with new applications. The technology time, effort, and expertise, it can efficiently accommodate
maturity and the standardized architectures for SmartNICs can nearly any functionality within the confines of available gates
still be considered low in contrast to traditional and offload on the FPGA. However, FPGAs are known for being challeng-
NICs. ing to program and can be costly.
Integrating both ASIC and FPGA within the SmartNIC
presents a balanced solution. Common functions are efficiently
executed on the ASIC, leveraging its ease of programmability
IV. S MART NIC S A RCHITECTURES
compared to the FPGA. Functions that cannot be programmed
on the ASIC will be implemented on the FPGA, providing
The definition of a SmartNIC in Section III-C targeted SoC
flexibility, albeit with increased programming complexity. This
SmartNICs. SoC SmartNICs comprise computing units, which
design provides high packet processing speed but is costly due
include a general-purpose ARM/MIPS multicore processor. It
to the use of FPGA technology.
also includes a multi-level onboard memory hierarchy. There is
another category of SmartNICs, referred to as discrete Smart- Table IV shows some popular commercial discrete Smart-
NICs. A discrete SmartNIC does not incorporate CPU cores NICs from various vendors and their specifications.
and thus, cannot run autonomously without a host platform.
Regardless of whether the SmartNIC is SoC or discrete, its
packet processing logic may be ASIC and FPGA. Various B. SoC SmartNICs
SmartNICs available in the market may employ either of these
hardware architectures or in some cases, a combination of Integrating general-purpose CPU cores into the SmartNIC
both. The SmartNICs architecture taxonomy is shown in Fig. can offer several advantages: 1) it significantly reduces pro-
7. Table III summarizes the differences between the various gramming complexity, as these cores can be programmed
SmartNIC architectures, as described next. using languages such as C; 2) the flexibility of the system
is greatly enhanced, allowing for the implementation of a
wide range of programs, including those with complex features
TABLE III like loops and multiplications. This versatility is particularly
C OMPARISON BETWEEN VARIOUS S MART NIC ARCHITECTURES . challenging to achieve on ASIC or FPGA; 3) the management
of the SmartNIC will be easier and independent of the host;
Programming 4) it will be possible to run an OS and make the SmartNIC
Architecture Cost Flexibility Speed
Complexity
ASIC Low Low Low High
autonomous. While the CPU cores allow additional features
FPGA High High Medium High on the NIC, functions executed on the CPU cores might
ASIC + FPGA High Medium Medium High not achieve line-rate performance and could incur increased
ASIC + CPU Medium Low High Medium latency. Table V shows some popular commercial SoC Smart-
FPGA + CPU High High High Medium
NICs from various vendors and their specifications.
8

TABLE IV
C OMMERCIAL D ISCRETE S MART NIC S FROM VARIOUS V ENDORS .

PCIe Bandwidth Technical


Architecture Vendor Model
generation (Gbps) document
ASIC Mellanox ConnectX -5 / 6LX / 6DX / 7 3/4/5 50 / 100 / 200 / 400 [60]–[63]
Archonix VectorPath S7t 5 400 [64]
AMD Alveo U50 / U50 LV / U55C / U200 / U250 / U280 3/4 1x100 / 2x100 [65]–[68]
FPGA
Napatech NT200A02 3 2x100 [69]
N501x / N5110A / FB2CDG1 / N6010/6011
Silicom 3/4/5 4x25 / 2x100 / 4x100 / 2x400 [70]–[74]
/ FB4XXVG TimeSync
ASIC+FPGA NVIDIA Innova -2 Flex 4 2x100 [75]

TABLE V
C OMMERCIAL S O C S MART NIC S FROM VARIOUS V ENDORS .

CPU PCIe Bandwidth Technical


Architecture Vendor Model Core type
cores generation (Gbps) document
2x25 / 2x100 /
AMD Pensando Giglio / DSC2-(25/100/200) Arm A72 16 4 [55], [76],
2x200
[77]
Asterfusion Helium EC2004Y / EC2002P Arm V8 24 3/4 4x24 / 2x100 [78], [79]
Broadcom Stingray PS225-H16 Arm A72 8 3 2x25 [80]
ASIC+CPU
Intel / Google E2000 Arm Neoverse N1 16 4 200 [81]
Marvell LiquidIO III Arm A72 36 4 5x100 [82]
Netronome Agilio FX / CX Arm A72 4 3 2x10 / 2x40 [83], [84]
NVIDIA BlueField 2 - 2X Arm A72 8 4 200 [54]
NVIDIA BlueField 3 -3X Arm A78 16 5 400 [85]
2x25 / 1x100 /
AMD Alveo U25N / U45N / SN1000 Arm A53/A42 4 / 16 3/4 [56], [86],
2x100
FPGA+CPU [87]
N6000-PL / N6001-PL /
Intel Arm A53 / Xeon D 4/8 3/4 2x25 / 2x100 [88]
C6000X-PL / C5000X-PL
Napatech NT400D1xSCC / F2070X IPU Arm A53 / Xeon D 4/8 4 2x100 [89]

C. On-path and Off-path SmartNICs cores and memory in a separate SoC located next to the
Another way to categorize the architectures of SmartNICs NIC cores. The offloaded code is strategically placed off the
is based on how their NIC cores interact with network traffic. critical path of the network processing pipeline. The SoC is
There are two categories: on-path and off-path [90]. treated as a second full-fledged host with an exclusive network
1) On-path SmartNICs: With on-path SmartNICs (Fig. 8 interface, connected to NIC cores and the host through an
(a)), the NIC cores actively manipulate each incoming and out- embedded switch (sometimes referred to as eSwitch). Based
going packet along the communication path. These SmartNICs on forwarding rules installed on the embedded switch, the
provide low-level programmable interfaces, allowing for direct traffic will be delivered to the host or the SmartNIC cores. In
manipulation of raw packets. In this design, the offloaded code contrast to on-path SmartNICs, the offloaded code in off-path
is closely situated to the network packets, increasing efficiency. SmartNICs does not impact the host’s network performance.
However, the drawback is that the offloaded code competes This clear separation enables the SoC to run a complete
for NIC cores with requests sent to the host. If too much kernel (e.g., Linux) with a comprehensive network stack
computation is offloaded onto the SmartNIC, it can result in (RDMA), simplifying system development and allowing for
a significant degradation of regular networking requests sent the offloading of complex tasks.
to the host. Additionally, programming on-path NICs can be Table VI summarizes the differences between the on-path
challenging due to the utilization of low-level APIs. and off-path SmartNICs.
2) Off-path SmartNICs: Off-path SmartNICs (Fig. 8 (b)) V. S MART NIC S D EVELOPMENT T OOLS AND
take a different approach by incorporating additional compute F RAMEWORKS
This section provides an overview of the development tools
TABLE VI and frameworks employed for programming SmartNICs. The
C HARACTERISTICS OF ON - PATH AND OFF - PATH S MART NIC S . taxonomy, illustrated in Fig. 9, categorizes them based on the
specific component within the SmartNIC being programmed.
Characteristics On-path Off-path
NIC switch × ✓ A. Programmable Pipeline
Operating system × ✓
Full network stack × ✓ The packet processing logic is commonly built using ASICs
Programming complexity High Low or FPGAs. The development of offloaded applications depends
Host performance impact High Low on the hardware architecture and the vendor’s Software De-
Complex code offloading Low High
velopment Kits (SDKs).
9

SmartNIC Development
Tools and Frameworks

Component-specific Technologies Software Dev. Environments

Programmable Pipeline CPU Cores NIC Switch Vendor-specific Vendor-agnostic


(Section V.A) (Section V.B) (Section V.C) (Section V.D-E) (Section V.F)

P4 Language Native OvS Control Plane ASIC Abstraction Framework

P4 Architecture: PNA Data Plane Dev. Kit (DPDK) ovs-vswitchd DOCA, ASAP 2 IPDK
P4 Compiler (p4c) Berkeley Packet Filter (BPF) OCTEON SDK OPI
OvS Data Plane Pensando SSDK SONIC-DASH
eXpress Data Path (XDP)
FPGA Programming Barefoot SDE
P4 Backends DPDK-based: rte_flow
VHDL, Verilog, ... Kernel-based: TC Flower FPGA
P4-DPDK
P4-FPGA Vitis Networking P4
P4-eBPF
Achronix ® Tool Suite
P4 to FPGA Bistream P4-uBPF Link
Netcope P4
OPAE

Fig. 9. Taxonomy of SmartNIC development tools and frameworks, categorized by component-specific technologies and software development environments.

1) P4 Language: In 2016, the PISA architecture was 2) P4 Architecture: A P4 architecture is a programming


introduced as a domain-specific processor for networking model that defines the capabilities of a target’s P4 processing
[57]. PISA is programmed using the Programming Protocol- pipeline. P4 programs are specifically designed for a particular
independent Packet Processor (P4) language [91]. Although P4 architecture, and these programs can be applied to any
P4 was initially intended to program the data plane of PISA- targets that adhere to the same P4 architecture.
based switches, it has demonstrated its versatility to program Although the P4 architecture is provided by the manufac-
data planes for other packet processing devices. Despite the turer of the device, it often follows the specifications of open-
variety of programming models used by various vendors, there source architectures. With the emergence of SmartNICs, the
is a common goal which is to make their pipeline programmed community has developed an open-source architecture tailored
in P4 [58]. for programming these NICs. This architecture is the Portable
P4 has a reduced instruction set and has the following goals: NIC Architecture (PNA) [94].
• Reconfigurability: P4 enables the reconfiguration of the a) Portable NIC Architecture (PNA): PNA [94] is a P4
parser and the processing logic in the field. architecture that defines the structure and common capabilities
• Protocol independence: P4 ensures that the device remains for SmartNICs. PNA has four P4 programmable blocks (main
protocol-agnostic, allowing the programmer to define pro- parser, pre-control, main control, and main deparser), and
tocols, parsers, and operations for processing headers. several fixed-function blocks, as shown in Fig. 11. The host-
• Target independence: P4 hides the underlying hardware to-net and net-to-host externs allow executing functions on
from the programmer, with the compiler considering the the domain-specific accelerators such as encrypting or de-
device’s capabilities when transforming a target-independent crypting IPsec payload. The message processing is responsible
P4 program into a target-dependent binary. for converting between large messages in host memory and
The initial specification of the P4 language, denoted as network size packets on the network and for dealing with
P414 , was released in 2014 [92]. Subsequently, in 2016, a
more refined version known as P416 was drafted [93]. P416
Input code Target device
represents a matured language that extends the capabilities of
P4 to a broader range of underlying targets, including ASICs, P4 program Control plane
FPGAs, SmartNICs, and more. (data plane)
Fig. 10 shows the workflow of developing a P4 program and P4 compiler Data plane API
Binary
deploying it into a target device. The P4 code is written by the
P4 architecture
user. The code must include a P4 architecture model, which is model Data plane
typically supplied by the device’s manufacturer. The code is
then compiled by a P4 compiler, which generates two artifacts:
User-supplied Manufacturer-supplied Compiler output
1) the binary that will be deployed in the data plane of the
target device; and 2) data plane APIs that will allow the control Front-end supplied by community, back-end by manufacturer
plane to interact with the data plane (e.g., for generating table
entries, manipulate stateful memories, etc.). Fig. 10. P4 workflow.
10

P4 programmable Fixed function Planned extension to P4


P4 program Backend
(data plane) Target 1
FROM_NET FROM_HOST compiler 1

Main parser
... ...
Frontend Intermediate

Host 1
compiler Representation (IR) Backend

Message processing
Target N
compiler N
Network Ports

Pre control

...
Main control Fig. 12. P4 compilation process.

Host N
Main deparser

TO_HOST TO_NET compiled by the backend compiler for a specific target. The
backend is provided by the manufacturer of the device.
Host-to-net Net-to-host 4) FPGA Programming: FPGAs consist of an array of
inline extern inline extern configurable logic blocks and programmable interconnects,
allowing users to define the functionality of the chip based
Fig. 11. Portable NIC Architecture (PNA). on their application requirements. FPGA-based SmartNICs
follow the same programming workflows as other FPGAs
provided by the vendors. This means that the development
one or more host operating systems, drivers, and/or message tools, methodologies, and languages used for programming
descriptor formats in host memory. traditional FPGAs can be applied to SmartNICs as well. FPGA
PNA has features that are not traditionally supported by vendors provide software tools that facilitate the program-
other similar P4 architectures including: ming process. These tools include Integrated Development
1) Table entries modification: Other P4 architectures only Environments (IDEs) and compilers that translate Hardware
allow modifying table entries from the control plane. PNA Description Languages (HDLs) such as VHDL and Verilog
allows modifying the entries in a table directly from the into configuration files for the FPGA.
data plane. 5) P4-FPGA: Programming FPGAs with languages such
2) Table accessibility: traditional P4 architectures allow only as VHDL or Verilog can be challenging and time-consuming,
one operation on a table per stage. With PNA, tables can be especially for newcomers. To address this issue, frameworks
accessible by multiple stages, even in different pipelines. have been developed to translate P4 code into FPGA bitstream.
3) Non-packet processing: PNA facilitates message process- P4, being a high-level and user-friendly language ideal for
ing, enabling operations on larger blocks of data to be programming datapaths, offers a faster and more efficient
transferred to and from host memory. alternative for FPGA programming. This approach streamlines
4) Accelerator invocation: PNA is the only P4 architecture that the programming process, making it particularly accessible for
supports invoking accelerators (e.g., crypto accelerator). users without extensive FPGA programming expertise, ulti-
Table VII compares and contrasts PNA and the Portable mately enhancing both accessibility and efficiency. However,
Switch Architecture (PSA), an architecture mainly used by there are challenges in designing a compiler that translates
switches. P4 code to VHDL or Verilog. First, FPGAs are typically
3) P4 Compiler: After writing a P4 program, the program- programmed using low-level libraries that are not portable
mer invokes the compiler to generate a binary that will be across devices. Second, generating an efficient implementation
deployed on the target device (e.g., the programmable pipeline from a source P4 program is difficult since programs vary
of the SmartNIC). Consider Fig. 12. The P4 compiler (p4c) has widely and architectures make different tradeoffs.
a frontend and a backend. The frontend is universal across all The community has been actively working on developing P4
targets and handles the parsing, syntactic analysis, and target- FPGA compilers. The vendors (e.g., Xilinx [95], Intel [96])
independent semantic analysis of the program. The frontend are providing their workflows to generate bitstreams from
generates an Intermediate Representation (IR) which is then P4 on their targets. P4-FPGA tools can significantly reduce
the engineering effort required to develop packet-processing
systems based on devices while maintaining high performance
TABLE VII per Lookup Table (LUT) or Random Access Memory (RAM).
C OMPARISON BETWEEN P4 ARCHITECTURES : PSA AND PNA.

Feature PSA PNA B. CPU Cores


Main target devices Switches SmartNICs
Table entries modification × ✓
User applications run on CPU cores, whether on the cores
Table accessibility One stage Multiple stages on the SmartNIC, or the cores in the host. The steps for an
Non-packet processing (message
× ✓ application to process a packet coming from the NIC are
processing)
shown in Fig. 13 (a). When a packet is received, the NIC
Accelerator invocation × ✓
Directionality (host-to-net, net-to-host) × ✓ triggers an interrupt that informs the OS about the packet’s
TCP connection tracking × ✓ location in memory. The OS subsequently transfers the packet
Stateful elements (counters, to the network stack which then initiates system calls from
✓ ✓
registers, meters, etc.)
the OS kernel to deliver the packet to its intended user-level
11

User Application User User User


Application Application Application
space space space space
DPDK PMD

Kernel Kernel
space Kernel Kernel
space space space
Stack Stack
Stack Stack
Kernel Kernel
Network Network Kernel Kernel
driver driver Network Network
driver driver
XDP
Hardware Hardware
NIC NIC
Hardware Hardware
NIC
(a) Standard packet processing (b) Kernel-bypass using DPDK NIC
XDP

Fig. 13. Software packet processing. (a) standard packet processing (interrupt- (a) Native XDP (b) Offloaded XDP
based), (b) kernel-bypass packet processing (polling mode).
Fig. 14. XDP packet processing. (a) Native XDP, slower, (b) Offloaded XDP,
faster.
application. These steps induce overheads that dramatically
degrade the bandwidth throughput. Today’s NICs have al-
ready reached more than 200Gbps [18]. As NICs become a packet is received on the NIC. There are three models for
faster, the available time for processing individual packets deploying an XDP program:
becomes increasingly limited. For instance, with 200Gbps, the
time between consecutive 1500-byte packets is as low as 60 • Generic XDP: XDP programs are incorporated into the
nanoseconds (ns). The standard network stack is inadequate kernel within the regular network path. While this method
to keep up with the high traffic rates. does not deliver optimal performance advantages, it serves
1) Data Plane Development Kit (DPDK): DPDK comprises as a convenient means to experiment with XDP programs
a collection of libraries and drivers designed to enhance or deploy them on standard hardware that lacks dedicated
packet processing efficiency by bypassing the kernel space support for XDP.
and handling packets within user space (see Fig. 13 (b)). • Native XDP: The NIC driver loads the XDP program during
With DPDK, the ports of the NIC are disassociated from the its initial receive path, see Fig. 14 (a). Support from the NIC
kernel driver and associated with a DPDK-compatible driver. hardware is required for this mode.
In contrast to the conventional method of packet processing • Offloaded XDP: The XDP program is loaded directly by the
within the kernel stack using interrupts, the DPDK driver NIC hardware, bypassing the CPU as a whole, see Fig. 14
operates as a Poll Mode Driver (PMD). It consistently polls (b). This requires support from the NIC hardware.
for incoming packets. The utilization of a PMD, combined 3) P4 Backends: Creating P4 programs is generally con-
with the kernel bypass, yields superior packet processing sidered more straightforward compared to writing DPDK
performance. DPDK’s APIs can be used in C programs. or BPF/XDP code. Consequently, there have been efforts
DPDK started as a project by Intel and then became open to translate P4 into these codes. The P4 compiler (p4c) is
source. Its community has been growing, and DPDK now equipped with backends specifically designed for generating
supports all major CPU and NICs architectures from various DPDK, BPF/XDP, and Userspace BPF (uBPF) codes. Table
vendors. A list of supported NICs can be found at [97]. VIII compares the P4 backends.
2) eXpress Data Path (XDP) and extended Berkeley Packet
a) P4-DPDK: The p4c-dpdk backend translates P416
Filter (eBPF): When utilizing DPDK, the kernel is bypassed
programs into DPDK Application Programming Interface
to achieve enhanced performance. However, this comes at the
(API), allowing the configuration of the DPDK software
cost of losing access to networking functionalities provided
switch (SWX) pipeline [98]. The P4 programs can be written
by the kernel. User space applications are then required to
re-implement these functionalities. XDP presents a solution
to this issue. XDP operates as an eBPF program within the
kernel’s network code. It introduces an early hook in the RX TABLE VIII
C OMAPRISON BETWEEN THE P4 BACKENDS .
(receive) path of the kernel, specifically within the NIC driver
after interrupt processing. This early hook allows the execution Features P4-DPDK P4-eBPF/XDP P4-uBPF
of a user-supplied eBPF program, enabling decisions to be Userspace ✓ × ✓
made before the Linux networking stack code is executed. NIC support ✓ ✓ ✓
Decisions include dropping packets, passing packets to the [ebpf,xdp]
P4 Architectures PNA, PSA ubpf model.p4
model.p4
normal network stack, and redirecting packets to other ports on P4→spec P4→C→ P4→C→
the NIC. XDP reduces the kernel overhead and avoids process Compilation
→C→so eBPF bytecode uBPF bytecode
context switches, network layer processing, interrupts, etc. Supported
High Low Medium
Features
XDP programs have callbacks that will be invoked when
12

p4c- DPDK pipeline .c C


.p4 .spec .so
dpdk library compiler Userspace ovs-vswitchd

Fig. 15. P4 DPDK pipeline.


Kernel Datapath
4
for the PNA or PSA architectures . The backend transforms a
given P4 program into a representation (.spec) that aligns with First packet Subsequent packets
the DPDK SWX pipeline (see Fig. 15). The subsequent step
involves the generation of a C code from the .spec file. This Fig. 16. Open vSwitch (OvS) components.
code includes C functions corresponding to each action and
control block. A C compiler then generates a shared object
plane (ovs-switchd) and the data plane, also known as the
(.so) from the C code. It is important to note that P4-DPDK is
datapath.
not a P4 simulator (e.g., BMv2); it achieves high performance.
1) OvS Control Plane: Fig. 16 demonstrates how the OvS
b) P4-eBPF: The expressive powers of P4 and eBPF
components interact to forward packets. After receiving the
programming languages differ, yet there is a significant over-
first packet of a flow, the datapath forwards it to the ovs-
lap, particularly in network packet processing. The P4 to
switchd. The ovs-switchd then determines how the packet
eBPF compiler translates P4 programs into a restricted subset
should be handled, and then sends the packet back to the data
of C code that is compatible with eBPF. The P4 program
plane with the desired handling method. It also instructs the
only defines the data plane. The control plane is separately
data plane to cache the action for handling similar packets.
implemented; BPF Compiler Collection (BCC) tools simplify
Subsequent packets are then matched and their actions are
this by generating C and Python APIs for the interaction
executed, all in the data plane. Actions may include packet
between the data plane and the control plane. The P4 to eBPF
modification, packet sampling, packet dropping, etc.
compiler also facilitates the integration of custom C extern
The OvS control plane is traditionally executed on the host
functions, enabling developers to extend the P4 program’s
in the userspace. With SmartNICs, the OvS control plane is
functionality by incorporating eBPF-compatible C functions.
executed on the CPU cores of the SmartNICs.
This capability empowers the P4 program with features not
2) OvS Data Plane: The standard OvS switch’s datapath
natively supported by the P4 language. Upon compilation,
is situated in the kernel (see Fig. 16), which strains CPU
the P4 compiler generates a C file and its corresponding
resources and degrades the performance. This performance
header. A subsequent C compiler then generates an eBPF
degradation becomes more pronounced with an increasing
program, loadable into the kernel using the Traffic Control
number of flows and more complex policy rules, consuming
(TC). Once loaded, manipulating tables in the eBPF program
multiple CPU cores for datapath operations and ultimately
can be achieved with the bpftool provided by the kernel.
resulting in the lowest server utilization. To address these
c) P4-uBPF: uBPF adapts the eBPF processor to run in
issues, many SmartNICs offer support for offloading OvS
userspace. The utilization of uBPF is advantageous due to its
into their NIC switch. When this feature is utilized, the
compatibility with any solution implementing kernel bypass,
OvS datapath is moved to the hardware, resulting in superior
such as DPDK apps.
performance compared to the software-based versions.
The p4c-ubpf compiler translates P4 programs into uBPF a) OvS-DPDK and rte flow: OvS-DPDK enhances OvS
programs. The backend for uBPF predominantly relies on by incorporating a DPDK-based datapath in the userspace,
the P4-eBPF compiler, but generates C code compatible with surpassing the performance of the standard kernel OvS dat-
user space BPF implementation. The uBPF backend offers a apath with reduced latency. OvS-DPDK leverages hardware
broader scope compared to the eBPF backend. Beyond simple offload capabilities through rte flow [99], a DPDK-based API,
packet filtering, the P4-uBPF compiler supports P4 registers see Fig. 17 (a). This API facilitates the installation of rules
and programmable actions, encompassing packet modifica- into the hardware switch within the SmartNIC (NIC switch).
tions and tunneling. The generated C programs are compiled The rte flow API enables users to define rules for matching
by the clang compiler, which generates uBPF bytecode. The
bytecode is then loaded into the uBPF VM.
VM or VM or User space VM or
C. NIC Switch container container rte_flow container
VF Driver VF Driver VF Driver
API
The NIC switch performs QoS traffic control and steers OVS-Bridge Kernel
Kernel
traffic to the NIC execution engines. SmartNICs typically TC_Flower API
implement the NIC switch following the specifications of the VF VF VF
SmartNIC SmartNIC
open-source Open vSwitch (OvS). OvS is a software switch NIC switch NIC switch
originally designed to enable communication among Virtual Port Port

Machines (VMs). OvS has two major components, the control (a) OvS DPDK with HW offload (b) OvS kernel with HW offload (TC Flower)

4 Preliminary experiments show that more features are implemented for PNA Fig. 17. OvS hardware offload. (a) OvS DPDK with hardware offload using
over PSA for the P4-DPDK target. rte flow; (b) OvS kernel with hardware offload using tc flower.
13

OCTEON SDK
gRPC
Host client Extension packages

DPDK Network VPP PCIE Offload IPSec Secure Key Storage


pf0 vf0
Virtualization layer
BlueField pf0hpf OVS-DPDK KVM Docker/CNI VPP vSwitch
OVS-
CPU Cores BR3 Base SDK
gRPC Boot Loader Linux Kernel DPDK IPSec Toolchain
OVS- server SF2
BR1
SF0
URL filter
OVS- app Regex Fig. 19. OCTEON SDK layers and modules.
BR2 SF1 accelerator

port
processing by defining matching criteria and actions. These
match-action units are defined in pipes, which can be chained.
Data traffic path Runtime configuration Given DOCA’s reliance on DPDK, it leverages rte flow to
transmit rules to the embedded switch (NIC switch). NVIDIA
Fig. 18. DOCA URL filter reference application. employs its proprietary ASAP2 technology [102] for imple-
menting the embedded switch and for efficient traffic offload-
ing to the hardware.
specific traffic, altering the packets, querying related counters, Consider Fig. 18 which shows an example of a DOCA
etc. Matching within this context can be based on various application for Uniform Resource Locator (URL) filtering. The
criteria such as packet data (including protocol headers and developer must create OvS bridges and connect scalable func-
payload) and properties like associated physical port or virtual tions (SF)5 to them. Note that the OvS bridge is hardware of-
device function ID. Operations supported by the rte flow API floaded. In this specific example, one bridge is used to connect
include dropping traffic, diverting traffic to specific queues, the physical port to the application (OvS-BR2). Another bridge
directing traffic to virtual or physical device functions or ports, is used to connect the application to the host (OvS-BR1). The
performing tunnel offloads, and applying marks, among others. incoming packets on the physical port will be forwarded to
b) OvS-Kernel and TC Flower: The OvS-kernel can use the application, which runs on the CPU cores. URL filtering
the TC Flower [100] to configure rules on the hardware switch involves parsing the application layer because the URL to be
integrated into the SmartNIC, see Fig. 17 (b). Within the Linux visited is located at the HTTP header. The SmartNIC will
kernel, the TC flower classifier, which is a component of the invoke the regular expression (RegEx) hardware accelerator to
TC subsystem, offers a means to specify packet matches using scan for the URL, which is significantly faster than scanning
a defined flow key. This flow key encompasses fields extracted using the CPU. A third bridge can be created to enable the
from packet fields and, if desired, tunnel metadata. TC actions user to manage the application (e.g., specifying the URLs to be
enable the execution of diverse operations on packets, such as blocked). BlueField provides gRPC interfaces for the runtime
drop, modify, output, and various other functionalities. configuration.
It is possible to develop DOCA applications without the
D. Vendor-specific SDKs - ASIC hardware; however, testing the compiled software must be
done on top of a BlueField [103].
The following SDKs are proprietary and target ASIC-based 2) OCTEON SDK: The OCTEON SDK is a comprehensive
SmartNICs. suite that integrates a development environment and optimized
1) NVIDIA’s DOCA: The Data Center-on-a-Chip Architec- software modules for building applications on OCTEON fam-
ture (DOCA), is a software development framework developed ily processors. The suite consists of a base SDK, a vir-
by NVIDIA for the BlueField SmartNICs [54]. This frame- tualization layer, and a collection of SDK extension pack-
work encompasses various components, including libraries, ages designed for specific application functions. The Base
service agents, and reference applications. Applications devel- SDK relies on a standard Linux environment and user-space
oped using DOCA are written in the C programming language DPDK (see Fig. 19). It facilitates the seamless compilation
and incorporate support for DPDK. This integration ensures of DPDK, Linux, or control plane applications on top of it
that developers have access to all DPDK APIs for efficient with minimal adjustments. Programmers write C code and
packet processing. Additionally, DOCA comes equipped with invoke libraries for accelerating functions, including compres-
its own set of libraries designed to streamline interactions with sion/decompression, regex matching, encryption/decryption,
the components on the SmartNIC. For instance, to implement and more.
IPsec or perform encryption and decryption, DOCA offers In addition to the Base SDK, the suite includes SDK
dedicated APIs that developers can easily invoke, simplifying extensions that help users enable complex applications. These
the integration of these functionalities into their applications.
One noteworthy library within DOCA is the DOCA Flow 5 An SF is a lightweight function that has dedicated queues for sending and
[101]. This library allows programmers to customize packet receiving packets; it is analogous to the virtual function (VF) used in SR-IOV.
14

TABLE IX
AMD Pensando SSDK
C OMPARISON BETWEEN THE VENDOR - SPECIFIC SDK S FOR ASIC
Development Libraries and sample code S MART NIC S .
toolchain Platform CPU sample P4 sample
NVIDIA Octeon Pensando Intel/Barefoot
library codes code Characteristic
Build environment DOCA SDK SSDK SDE
with P4 compiler Supported BlueField Marvel Pensando Intel IPU
Drivers for CPU cores
SmartNICs 2/3/X LiquidIO DSC-200 E2000
DSC simulator and
Linux kernel driver DPDK driver P4 support ×* ×* ✓ ✓
test environment
Development
✓ ✓ ✓ ×
wo/ hardware
Simulator/
× ✓ ✓ ×
emulator
Fig. 20. AMD Pensando SSDK.
Special licensing × × ✓ ✓
*While P4 is not the main language used for programming the packet
processing engine, it can be used for programming the CPU cores (e.g., with
extensions consist of pre-optimized, application-specific mod- P4 DPDK).
ules bundled into packages that run on the Base SDK. No-
table extensions include OvS-DPDK, Vector Packet Processor
(VPP), secure key storage, trusted execution environment, etc. Insight [104]) that offers comprehensive insights into resource
Furthermore, the OCTEON SDK provides a cycle-accurate utilization. This includes details such as the location of specific
simulator. This simulator enables the developers to test the match-action tables, the utilization of hash bits, and the usage
behavior of their programs with precision and accuracy in of SRAM/TCAM. The public documentation does not provide
software. clear specifics on how the compiler differs between Tofino
3) AMD Pensando SSDK: The AMD Pensando SDK facili- switches and SmartNICs.
tates software development for the AMD Pensando SmartNIC. 5) SDKs for ASIC SmartNICs Comparison: Table IX com-
This comprehensive SDK includes a P416 compiler, debugging pares the four SDKs. The characteristics compared include the
tools, a DPDK driver, example codes, and thorough documen- supported SmartNIC models, P4 language support, develop-
tation (see Fig. 20). Specifically, P416 can be used to write ment feasibility with or without dedicated hardware, availabil-
code for execution in the programmable pipeline. C and C++ ity of simulators or emulators for testing, and the necessity for
are used to write code for the CPU core complex. Additionally, special licensing. The AMD Pensando and Intel SmartNICs are
the SDK allows invoking the SmartNIC’s built-in domain- P4 programmable and thus, their SDKs provide a P4 compiler.
specific accelerators. The NVIDIA BlueField and the Octeon SDK only support P4
Similar to DOCA, developers have the flexibility to com- for their CPU cores (e.g., through P4-DPDK). Furthermore, all
pile applications without the SmartNIC hardware. However, SDKs except Intel/Barefoot SDE offer development without
unlike DOCA, Pensando SDK provides a simulator, allowing dedicated hardware, and Pensando SSDK and Octeon SDK
developers to test their ideas before uploading the image provide simulators or emulators for testing purposes. The
to the hardware. This validation capability becomes partic- Pensando SSDK and the Intel SDE require the customer to
ularly advantageous when integrating the SDK and simulator sign a Non-disclose Agreement (NDA) to get the license for
into CI/CD-based development and workflows. The simula- the SDKs.
tor boasts machine-register accuracy, ensuring that any code
developed for it can be cross-compiled to run seamlessly E. Vendor-specific SDKs - FPGAs
on the real hardware. The simulator serves as a valuable The following SDKs are proprietary and target FPGA-based
tool for validation, speeding up development, and simplifying SmartNICs.
debugging processes within a virtualized environment. 1) Vitis Networking P4: Vitis Networking P4 [105], de-
The reference applications included with the AMD Pen- veloped by AMD Xilinx, is the development environment for
sando SDK include a basic skeleton hello world, Software their FPGA SmartNICs. This high-level design environment
Defined Network (SDN) policy offload with Longest Prefix greatly simplifies the creation of packet-processing data planes
Matching (LPM), Access Control List (ACL), flow aging, through P4 programs (see Section V-A5). The tool’s primary
IPsec gateway, and other classic host offload such as TCP Seg- function is to translate the P4 design intent into a compre-
mentation Offload (TSO), checksum calculation, and Receive hensive AMD FPGA design solution. The compiler maps the
Side Scaling (RSS). control flow with a custom data plane architecture composed
4) Barefoot SDE / Intel P4 Studio: The compiler used for of various engines. This process involves selecting suitable
programming the pipeline on the Intel IPU has similarities with engine types and tailoring each one according to the specified
that used for programming the Tofino switches [81]. The com- P4 processing requirements. The architecture definition file for
piler was originally developed by Barefoot Networks, which Vitis Networking P4 is named xsa.p4. This architecture follows
was acquired by Intel in 2021. This compiler was formerly the open-source P4 PNA architecture (see Section V-A2a).
known as Barefoot SDE, and now it has been rebranded as Fig. 21 illustrates the AMD Vivado™ hardware tool flows
Intel P4 Studio. It is well-established and has undergone ex- designed for AMD Vitis™ Networking P4 implementations.
tensive revisions and optimizations. Additionally, the compiler There is a flow for the software, which is used for testing the
is equipped with a Graphical User Interface (GUI) tool (P4 behavior of the P4 program. The other flow is for the hardware.
15

.p4 User’s P4
program
.p4
Vitis Networking P4 IP Intel P4 compiler
P4C-vitisnet P4C-vitisnet for FPGA
compiler compiler
.p4 Software (CPU)
Data plane Control plane
.json .sv Custom arch. RTL APIs Control plane
include file application
Intel P4 FPGA
Launch Run synthesis/ .v .json Software Framework
simulation (RTL) implementation Custom
arch. RTL
.bit
.v
.meta
Behavioral model
.meta FPGA dev. (e.g., FPGA FPGA
RTL simulator HW test Intel Quartus) Bin
p4bm-vitisnet-cli hardware
.pcap .pcap

Software flow Hardware flow .v

Custom
arch. RTL
Fig. 21. AMD Xilinx’s Vitis Networking P4 software and hardware flows.
Fig. 22. Intel P4 Suite for FPGA workflow.

In the software flow, the P4C-vitisnet compiler accepts a P4


file as input and generates an output .json file. This json includes place and route functions, timing analysis and bit-
file is provided to the P4 Behavioral Model. Note that this stream generation and download, synthesis analysis, and in-
behavioral model provided by Xilinx closely resembles the system debugging using snapshots. The Achronix design flow
behavioral model created by the community, known as BMv2. facilitates ease for FPGA designers by supporting standard
The software flow also has a Command Line Interface (CLI) RTL (VHDL and Verilog) input and employing industry-
which is a control plane application used to interact with the standard simulation techniques.
data plane at runtime. Packets and metadata information can 4) Napatech Link Toolkit: The Link toolkit, developed by
be fed into the behavioral model when testing. Napatech, is a collection of software providing plug-and-play
For the hardware flow, the P4C-vitisnet compiler generates features for various SmartNICs from Napatech. The software
an .sv file. The compiler uses the Vitis Networking P4 IP. includes: Link-capture, which facilitates packet capture with
The .sv file can be used for launching Register Transfer nanosecond timestamping and replay with precise inter-frame
Level (RTL) simulation on the RTL simulator or for running gap control. Link-Inline accelerates various network and secu-
synthesis/implementation on the hardware. The hardware flow rity applications, Link-Virtualization offloads a virtual switch
also accepts metadata and packets as inputs. to the SmartNIC, and Link-Storage accelerates virtualized
Xilinx also provides the Xilinx Runtime library (XRT) storage.
[106], which is a user-friendly, open-source software stack 5) OFS and OPAE: The Open FPGA Stack (OFS) is an
designed to facilitate communication between the application open-source solution that offers a hardware and software
code and the FPGA device. It offers APIs for Python and framework for creating shell design and workload [107].
C/C++ applications. OFS includes reference shell designs for various Intel FPGA
2) Intel P4 Suite for FPGA: The Intel P4 Suite for FPGA is devices, drivers, and software tools. Using OFS, Application-
a high-level design toolkit that produces packet processing IP
for Intel-based FPGAs from P4 codes. This toolkit comprises
a compiler responsible for converting P4 into RTL and a Developer, network
software framework equipped with APIs that facilitate the administrator
interaction between control plane applications and the data OPI REST, gRPC
plane. The toolkit’s workflow is shown in Fig. 22. Intel’s P4 Vendor API gateway load Vendor
compiler for FPGA accepts as input the P4 program and a balancer gateway

custom architecture, and generates RTL for the data plane


and APIs for the control plane. The resulting RTL, combined Network shim Storage shim Security shim AI/ML shim
with the architecture’s RTL and the shell RTL are then API API API API

transformed into an FPGA binary using conventional FPGA


development environments (e.g., Intel Quartus). Finally, this Software-development Kits (SDKs)
[NVIDIA, Intel, Xilinx …]
binary is deployed to the hardware. The control plane APIs are
pushed to the Intel P4 FPGA for Software Framework which SmartNIC targets
enables user-defined control plane applications to interact with
the hardware.
Fig. 23. Open Programmable Infrastructure (OPI) architecture. The vendor-
3) Achronix Tool Suite: The Achronix Tool Suite (ACE) specific SDKs and hardware are abstracted via a common API developed by
is used to design Achronix’s FPGA SmartNICs. The tool OPI.
16

SmartNIC Offloading Applications

Security Network Storage Compute

Csadompu
(Section VII) (Section VIII) (Section IX) (Section X)

Firewalls and packet filters Switching / routing Storage initiator Machine learning
Open vSwitch NVMe-oF initiator offload ML training
Intrusion detect/prevention
ML inference
Tunneling and overlay Target initiator
Flow bypass
Key-value stores
Deep Packet Inspection (DPI) VxLAN, GRE, Geneve NVMe-oF target offload
Custom IDP/IPS functions Data replication
Observability and telemetry Compression
Ordering
Data in transit encryption
Network observability Deflate, zlib, SZ3
IPSec offload System observability Transaction processing
TLS offload VM/containers observability Scheduling
Data at rest encryption Aggregation
Load balancing
Serverless computing
L4-L7 load balancers
Receive Side Scaling (RSS) Lambda on NIC
Heterogeneous devices
5G User Plane Function

Fig. 24. Taxonomy of SmartNIC offloaded applications.

Specific FPGA Interface Managers (FIMs) can be developed, leverages established tools like Storage Performance Devel-
making use of the Open Programmable Acceleration Engine opment Kit (SPDK), DPDK, and P4 to facilitate network
(OPAE) SDK. OPAE, which is a subset of OFS, is a software and storage virtualization, workload provisioning, root-of-trust
layer comprising various API libraries. These libraries are used establishment, and various offload capabilities inherent to the
when programming host applications that interact with the platform. IPDK is a sub-project of OPI.
FPGA accelerator. IPDK already supports multiple targets including P4 DPDK,
OCTEON SmartNICs, Intel IPU, Intel FPGA, and Tofino-
based programmable switch [109].
F. Vendor-agnostic IPDK has two main interfaces: 1) Infrastructure Application
1) Open Programmable Infrastructure (OPI): The OPI is a Interface; and 2) Target Abstraction Interface. The Infrastruc-
community-driven initiative focused on creating common APIs ture Application Interface serves as the northbound interface
to configure and manage different SmartNIC targets [108]. of the SmartNIC, encapsulating the diverse range of Remote
Instead of relying on vendor-specific SDKs, developers can Procedure Calls (RPCs) supported within IPDK. The Target
use OPI’s standardized APIs to activate services, effectively Abstraction Interface represents an abstraction provided by an
abstracting the complexities associated with vendor-specific infrastructure device (e.g., SmartNIC) that runs infrastructure
SDKs. Consider Fig. 23. The developer uses gRPC and REST applications for connected compute instances. These instances
APIs to initial calls to the API gateway. The gateway acts as could include attached hosts and/or VMs, which may or may
a load balancer between four shim APIs: network, storage, not be containerized.
security, and AI/ML. These shim APIs then translate the calls 3) SONIC-DASH: SONiC, an open-source operating sys-
to the hardware accelerators through the vendor-specific SDKs. tem for network devices, has experienced significant growth
With such a design, portability can be ensured across various [110], [111]. The SONiC community has introduced a new
targets. Note that the developers can still execute functions open-source project called DASH (Disaggregated APIs for
provided by the vendor if they are not available through the SONiC Hosts) aiming at being an abstraction framework for
OPI APIs. SmartNICs and other network devices. It consists of a set of
2) Infrastructure Programmer Development Kit (IPDK): APIs and object models which cover network services for
The IPDK is an open-source, vendor-agnostic framework the cloud. The initial objective of DASH is to enhance the
comprising drivers and APIs tailored for infrastructure offload performance and connection scale of SDN operations, aiming
and management tasks. It is versatile and capable of running to achieve a speed increase of 10 to 100 times compared
on a range of hardware platforms including SmartNICs, CPUs, to software-based solutions in today’s clouds and enterprise.
or switches. Operating within the Linux environment, IPDK DASH’s ecosystem includes a community of cloud providers,
17

Perimeter North-South Centralized characterized by North-South (NS) flows (between internal


security traffic security and external devices) that are protected by dedicated security
East-West
traffic
appliances (e.g., firewall) at the perimeter, see Fig. 25 (a).
However, in the past decade, the dynamics have been shifting
A A B
towards East-West (EW) flows (between internal data center
devices), accounting for up to 80% of total data center traffic
[113], [114]. Unlike North-South traffic, East-West traffic was
(a) (b)
relatively unprotected. A workaround for this was to use a
Distributed SW Distributed HW centralized security appliance and forward EW traffic to it for
security security
inspection6 , see Fig. 25 (b). This results in traffic traversing
the intermediary devices (i.e., switches) twice, leading to
duplication of both network load and the latency experienced
A B A B by the two hosts.
This has led to the emergence of Zero Trust and microseg-
(c) (d)
mentation architectures [115], whose main idea is to decentral-
ize security functionality and move it closer to the resources
Fig. 25. (a) Perimeter-based security. The appliance only inspects NS that require protection. Data centers and cloud providers have
traffic; (b) Centralized security. The appliance can inspect EW traffic, but shifted to using software-based security functions to protect
the bandwidth overhead is high; (c) Distributed SW firewall. Software-based
appliances are attached to the servers and can inspect EW traffic, but the
East-West traffic [116], see Fig. 25 (c). While this shift
performance is not high. (d) Distributed HW firewall. Appliances are offloaded is advantageous in terms of ease of deployment and cost-
to SmartNICs on the servers, enabling EW inspection with high performance. effectiveness, it has some drawbacks:
• Performance: Packets traverse the regular network stack to

hardware suppliers, and system solution providers. be processed by a security function on the general-purpose
CPUs. This increases latency and decreases the throughput.
• Scalability: The CPU cores often struggle to inspect traffic
VI. O FFLOADED A PPLICATIONS TAXONOMY
at high rates, particularly in the absence of software acceler-
This section describes the systematic methodology that was
ators (e.g., DPDK). This can lead to high packet drop rates.
adopted to generate the proposed taxonomy. The results of
• Isolation: all traffic, including malicious traffic, is sent to
this literature survey represent derived findings by thoroughly
the host. This lack of isolation can pose security risks.
exploring the SmartNIC-related research works published in
• CPU usage: security functions consume a substantial portion
the last five years.
of the CPU processing power, particularly during periods of
Fig. 24 shows the proposed taxonomy. The taxonomy was
high traffic volume. This can result in performance bottle-
meticulously designed to cover the most significant works
necks and service degradation for end-user applications.
related to SmartNICs. The aim is to categorize the surveyed
works based on various high-level disciplines. The taxonomy To mitigate these issues, SmartNICs have been used to
provides a clear separation of categories so that a reader offload the security functions from general-purpose CPUs, see
interested in a specific discipline can only read the works Fig. 25 (d). Specifically, SmartNICs have been used to offload
pertaining to that discipline. firewall functionalities, IDS/IPS, DPI, and data-at-motion and
SmartNICs accelerate various infrastructure applications, data-at-rest encryption.
categorized primarily into security, networking, and storage
functions. It also accelerates various computing workloads A. Firewall
including AI/ML inference and training, caching (key-value
A firewall monitors incoming and outgoing network traffic
stores), transaction processing, serverless functions, and oth-
and allows or blocks packets based on a set of precon-
ers. Each high-level category in the taxonomy is further
figured rules. Firewalls typically operate up to layer-4 to
divided into sub-categories. For instance, various transaction
perform basic ACL operations. This means that the traf-
processing works belong to the sub-category “Transaction
fic can be matched against network layer information (e.g.,
processing” under the high-level category “Compute”. Addi-
source/destination IP addresses) and transport layer informa-
tionally, the survey offers performance comparisons between
tion (e.g., source/destination port numbers).
applications running on the host and those offloaded to the
Software-based firewalls are widely being used, especially
SmartNICs.
in cloud environments [116]. They are typically implemented
The subsequent subsections delve into the ongoing devel-
in conjunction with a virtual switch (e.g., OvS). With software-
opments within each of the aforementioned category, offering
based firewalls, traffic is inspected using the CPU cores of
insights into the lessons learned from these advancements.
the host where the firewall is running. This degrades the
performance and consumes the compute capacity of the CPU.
VII. S ECURITY
Recall that SmartNICs are equipped with a programmable
The landscape of data center traffic has undergone a signifi- pipeline or an embedded switch, where match-action rules
cant transformation with the rise of cloud-hosted applications
and microservices [112]. Traditionally, traffic patterns were 6 Sometimes referred to as traffic tromboning.
Offload Type Throughput (Gbps)
Traditional (Software) 12 18
SmartNIC (Hardware) 0.3

CPU cores Host Throughput (Gbps) CPU cores on host


500 5
400 4
CPU cores
300 3
IDS/IPS API
200 2
100 1
NIC switch / 0 0
bypass Traditional SmartNIC Traditional SmartNIC
Programmable pipeline
(Software) (Hardware) (Software) (Hardware)
Port
SmartNIC
Slow path, without bypass
Throughput (Gbps) CPU cores on host
Traffic Fast path, with bypass
100 14
80 12
Fig. 26. IPS/IDS bypass offload to the SmartNIC. 10
60 8
40 6
4
20
can be defined. This makes it possible to implement firewalls 2
0 0
with basic ACLs directly on the hardware at line rate. The Traditional SmartNIC Traditional SmartNIC
functionality of the firewall can be implemented from scratch (Software) (Hardware) (Software) (Hardware)
by a developer. However, it requires implementing many
functions such as connection tracking if stateful inspection7 Fig. 27. Performance of IPS/IDS with hardware-based bypass (SmartNIC
is needed, flow caching and aging, etc. As an alternative, the offload) versus software-based bypass. The top row shows the results of
hardware offloaded switch on the SmartNICs is being used offloading the bypass of Suricata IDS while the bottom row shows the result of
offloading the bypass of Palo Alto’s NGFW. Both were offloaded to NVIDIA’s
to implement the firewall functionalities [117], [118]. The BlueField SmartNIC. Reproduced from [130].
switch rules can be transparently offloaded to the hardware.
The developer only needs to specify the rules that allow/block
traffic. The connection tracking feature of the switch can Current IDS/IPS systems incorporate bypass mechanisms
be leveraged to enable stateful inspection. As an example, within the software through the kernel datapath [129]. While
VMware allows offloading the firewall functionalities of its this enhances throughput, the process still depends on software
NSX distributed switch to the SmartNIC [18], specifically, the that utilizes CPU cycles to efficiently route packets directly to
L2-L4 inspection and firewalling. the user space. Consider Fig. 26. With an offloaded IDS/IPS,
the bypass function is implemented in the hardware without
B. Intrusion Detection/Prevention System requiring additional intervention from the IDS/IPS or the host
CPU [19], [130], [131]. The bypass is carried out on the
Intrusion Detection Systems (IDS) and Intrusion Prevention programmable pipeline or the embedded switch within the
Systems (IPS) are cybersecurity technologies designed to safe- SmartNIC. This significantly improves the performance.
guard networks and hosts from unauthorized access, malicious Fig. 27 shows the throughput (left) and the CPU cores
activities, and security threats. An IDS monitors and analyzes used on the host (right) of software-based IDS/IPS bypass
network or system events to identify suspicious patterns or and hardware-based (SmartNIC). The top row shows the
anomalies. It provides real-time alerts or logs for further results of offloading the bypass of Suricata IDS while the
investigation. On the other hand, an IPS goes a step further bottom row shows the result of offloading the bypass of
by actively preventing or blocking unauthorized activities in Palo Alto’s Next Generation Firewall (NGFW). Both were
real-time. Examples of open-source IDS/IPS include Zeek offloaded to NVIDIA’s BlueField SmartNIC. The results show
(formerly known as Bro) [119], Suricata [120], and Snort that hardware offloading (SmartNIC) can attain near line-rate
[121]. throughput (∼1200% better than the software in the case
IDS and IPS are generally deployed on the general-purpose of Suricata and ∼430% in the case of Palo Alto NGFW).
CPUs of the host. SmartNICs have been offloading IDS/IPS Moreover, the CPU cores on the host are almost idle all the
functions to accelerate data processing: time.
1) Offloading IDS/IPS bypass function: The IDS/IPS does 2) Offloading Deep Packet Inspection (DPI): IDS and IPS
not need to inspect every packet. Typically, the initial packets often require inspecting the payload within packets to identify
within a specific flow contain essential information, making potential malicious patterns. One example of this is URL
continuous inspection unnecessary. There are situations, such filtering, where the URL in the HTTP header of the packet
as when dealing with an elephant flow resulting from a large is matched against a database. This process is used for access
data transfer, where it is imperative not to inspect the packets control and blocking known harmful websites and phishing
throughout the flow’s lifetime. Additionally, encrypted traffic pages. The software-driven nature of IDS/IPS in performing
may also be exempt from inspection. URL filtering has notable implications on the performance.
7 Stateful inspection is a firewall technology that monitors and evaluates the
This is because, for every packet, the IDS/IPS must parse
state of active network connections, making decisions based on the context deep into the packet contents (DPI). In addition to URL
of the entire communication rather than individual packets. filtering, IDS/IPS systems apply DPI for various tasks such
19

TABLE X
C OMPARISON BETWEEN VARIOUS WORKS OFFLOADING IDS/IPS FUNCTIONS TO S MART NIC S .

Work Hardware used Targeted attack SmartNIC used Key points


Multi-string matching on FPGA; FPGA is used as a primary
Pigasus [122] FPGA, CPU General Intel Stratix engine; CPU is used as a secondary engine; Achieves
100Gbps using five CPU cores.
Offload rule pattern matching and traffic classification to
Fidas [123] FPGA General Xilinx FPGA; Achieves lower latency and higher throughput than
Pigasus.
FPGA-based traffic analysis; CPU-based threat detection
Zhao et al. [124] FPGA, CPU IoT traffic attacks N/A
using flow entropy algorithm.
P4 switches (ASIC) Netronome Coarse-grained traffic analysis on P4 switches; Finer-grained
SmartWatch [125] General
SmartNIC (CPU) Agilio analysis on SmartNIC.
ONLAD-IDS [126] CPU cores General BlueField Anomaly detection using ANOVA statistical method.
Natural Language Processing (NLP) for SQL query analysis;
Tasdemir et al. [127] CPU cores SQL attacks BlueField
ML classifiers for query classification.
Hardware and software-based packet filtering; Use cases
Miano et al. [128] ASIC, CPU DDoS N/A
target DDoS mitigation.

as application recognition (sometimes referred to as App-ID), cores of BlueField SmartNIC. The system uses the Analysis of
signature matching for malware, etc. Variance (ANOVA) statistical method for detecting anomalies.
SmartNICs are now integrating hardware-based RegEx en- Tasdemir et al. [127] implemented an SQL attack detection
gines. These engines perform pattern matching directly within system on the BlueField SmartNIC. The system uses NLP
the hardware, offering improved efficiency compared to tra- and ML classifiers to analyze and classify SQL queries.
ditional software-based approaches. Applications leveraging Miano et al. [128] implemented a DDoS mitigation system by
RegEx matching load a pre-compiled rule set into the engines combining hardware-based packet filtering on the SmartNIC
at runtime. This hardware-driven approach helps alleviate the and software-based packet filtering using XDP/eBPF.
performance concerns associated with DPI in IDS/IPS, making Table X summarises and compares the aforementioned
network security more robust and responsive. works that offload custom IDS/IPS functions to the Smart-
DPI has also been implemented on the hardware from NICs.
scratch (e.g., using an FPGA). Ceska et al. [132] proposed
an FPGA architecture for regular expression matching that can C. IPSec offload
process network traffic beyond 100Gbps. The system compiles
The Internet Protocol Security (IPSec) implements a suite
approximate Non-deterministic Finite Automata (NFAs) into a
of protocols to establish secure connections between end
multi-stage architecture. The system uses reduction techniques
devices. This is achieved through the encryption and authen-
to optimize the NFAs so that they can fit in the FPGA
tication of IP packets. IPsec comprises key modules includ-
resources. The system was implemented on Xilinx FPGA.
ing 1) key exchange, which facilitates the establishment of
Other works [133]–[136] have also explored optimizing NFAs
encryption and decryption keys through a mutual exchange
for FPGAs.
between connected devices; 2) authentication, which verifies
3) Offloading custom IPS/IDS functions: Zhao et al. [122]
the trustworthiness of each packet’s source; 3) encryption and
proposed Pigasus, an IDS that uses an FPGA to perform the
decryption, which encrypts/decrypts payload within packets
majority of the IDS functions, and a CPU to perform the
and potentially, based on the transport mode, the packet’s IP
secondary functions. Pigasus achieves 100Gbps with 100K+
header.
concurrent connections and 10K+ matching rules, on a single
IPSec has a data plane (DP) and a control plane (CP).
server. It requires on average five CPU cores and a single
The CP is responsible for the key exchange and session
FPGA-based SmartNIC. The system was tested using Intel
Stratix SmartNIC. Another FPGA-based solution proposed by
Chen et al. [123] is Fidas, which offloads rule pattern matching Plaintext packet IPSec encrypted packet
and traffic flow rate classification. Fidas achieves lower latency
and higher throughput than Pigasus. It was implemented on a Host Host

Xilinx FPGA. Zhao et al. [124] implemented an FPGA design Workload Workload
to analyze Internet of Things (IoT) traffic and summarize it
in real time. The CPU then uses a flow entropy algorithm to IPSec software
HW accelerators CPU
cores
detect the threats. (CP + DP) IPSec
OvS
Crypto
Panda et al. [125] proposed SmartWatch, a system that DP IPSec
Tunnel CP
combines P4 switches and SmartNICs to perform IDS/IPS Traditional
functions. The P4 switches perform coarse-grained traffic anal- NIC SmartNIC
ysis while the SmartNIC conducts the finer-grained analysis.
The SmartNIC used is Netronome Agilio. Wu et al. [126]
implemented an anomaly detection-based IDS on the CPU Fig. 28. IPSec on the host (a) and IPSec offloaded to the SmartNIC (b).
20

100 Line rate 15 TABLE XI


S ECURITY OFFLOADED SERVICE AND THE USED ACCELERATORS .
80
Throughput [Gbps]

Active CPU cores


10 Security Domain-specific accelerator
60
offloaded Symmetric Asym.
Match-action RegEx TRNG
40 service Crypto Crypto
5
Stateful
20 ✓
firewall
0 0 IDS/IPS ✓ ✓
1 8 512 64 128 512
Number of connections Number of connections
IPSec ✓ ✓ ✓
kTLS SW kTLS HW TCP kTLS SW kTLS HW TCP
TLS ✓ ✓ ✓
Storage
✓ ✓ ✓
encryption
Fig. 29. Throughput (a) and CPU core counts (b) of SW kTLS, HW kTLS,
and plaintext TCP.

number of CPU cores on the host when using HW kTLS is


establishment. The DP is used for encapsulating, encrypting, lower than the SW kTLS and the unencrypted TCP, regardless
and decrypting packets. Traditionally, the CP and DP of IPSec of the number of connections.
are executed fully in the host, see Fig. 28 (a). This consumes Kim et al. [140] explored offloading the TLS handshake
CPU cores, increases latency, and decreases throughput. Once to the SmartNIC. Their proof of concept on a BlueField
a packet is encrypted by the IPsec software, it is sent to the SmartNIC shows that there is a 5.9x throughput improvement
network over a traditional NIC. over executing the handshake on a single CPU core. Novais
The IPsec software can be offloaded to the SmartNIC to et al. [141] performed experimental evaluations to assess the
enhance security and performance, see Fig. 28 (b). The IPsec impact of TLS offload on a Chelsio SmartNIC. Their results
crypto operations (encryption/decryption) and encapsulation suggest that hardware offloading improves the throughput,
are executed by the hardware through the domain-specific latency, and power consumption. Zhao et al. [142] further
accelerators. The accelerators include symmetric cryptogra- detailed experiments on offloading TLS to SmartNICs. Their
phy algorithms (e.g., Advanced Encryption Standard (AES)), results suggest that SmartNICs can be beneficial for latency-
asymmetric cryptography (e.g., RSA, Diffie-Hellman), and sensitive tasks, but require caution with computationally heavy
a True Random Number Generator (TRNG). This deploy- loads.
ment model ensures transparency to the host, securing legacy
workloads while benefiting from the offloading capabilities of E. Data at Rest Encryption
IPsec.
Diamond et al. [137] measured the performance of IPsec SmartNICs can accelerate the encryption of data to be
encryption in hardware on the BlueField SmartNIC. The re- stored. Instead of using the CPU to encrypt the data, the
sults show that the offloaded IPSec is 10x faster than the fully domain-specific processor for encryption is used. Disk encryp-
software-based IPSec. Su et al. [138] evaluated IPSec using the tion protocol like AES-XTS 256/512-bit is used.
encryption accelerator on an FPGA SmartNIC. The offloaded The implementation of encryption offload for storage
IPsec attained ∼19x and ∼483x throughput improvement at through SmartNICs offers flexibility across various points in
64B and 1500B packet sizes, respectively. the storage data path. Encryption can occur directly on the
storage device (e.g., Just a Bunch of Flash (JBOF)), securing
data at rest. Alternatively, it may take place at the backend
D. TLS offload
of the storage controller, ensuring the encryption of data in
The prevalence of HTTPS servers using the TLS protocol transit. Another option involves encryption at the initiator, with
exceeds 80% across all web pages. As the demand for access- the initiator retaining control of the keys, and the encrypted
ing web servers continues to grow steadily, there is a need for data transmitted across the entire storage data path.
an increased rate of bandwidth.
TLS operates on layer 4, on top of TCP. The TLS process,
F. Summary and Lessons Learned
traditionally handled by user-space applications, has evolved
with the advent of offloading techniques. The kernel TLS SmartNICs significantly enhance the performance of se-
(kTLS) involves the offload of TLS operations into the kernel, curity inspection. Table XI summarizes the domain-specific
while Hardware (HW) kTLS offloads cryptographic functions accelerators used by the offloaded security functions. The key
to the domain-specific accelerators on the SmartNIC. With takeaways are:
HW kTLS, the TLS handshake and the error handling (e.g., • Offloading stateful firewall functions to the SmartNIC’s
incorrect sequence number) are performed in software, while embedded switch or programmable pipeline significantly
packets are encrypted and decrypted in hardware. boosts performance. The SmartNIC also allows running the
Consider Fig. 29 which shows the results of benchmarking firewall management and control planes on its CPU cores.
the performance of HW kTLS with an Nginx server, repro- • The performance of IDS/IPS can be enhanced when their
duced from [139]. The throughput of kTLS SW is smaller than bypass function is offloaded to the hardware. Additionally,
that of the HW kTLS when the number of connections is small. IDS/IPS can efficiently apply DPI using the RegEx hardware
The HW kTLS achieved line rate with eight connections. The embedded in the SmartNIC. This is used in various security
21

applications, including URL filtering, signature matching, Overlay


content filtering, etc.
• SmartNICs facilitate the encryption of data-in-flight without
compromising performance. Protocol stacks like IPSec or VM VM VM VM VM VM
TLS can be easily offloaded to the SmartNIC, without
vSwitch VTEP VTEP vSwitch
requiring much development. Also, the SmartNIC provides
APIs that enable developers to leverage the symmetric, Server Server
Underlay
asymmetric, and TRNG crypto processors.
• Data-at-rest can be encrypted by the SmartNIC, enabling Fig. 30. Tunneling.
faster storage encryption compared to the SW-based ap-
proach.
• The inherent programmability in SmartNICs opens avenues B. Tunneling and Overlay
for developers to implement custom, novel, and performant Tunneling is a technique that encapsulates and transports
security functionalities. one network protocol over another. This is commonly used
in virtualized environments to create isolated channels be-
tween VMs or between different segments of a virtualized
VIII. N ETWORK OFFLOADS network. Tunneling helps in overcoming the limitations of
the underlying physical network and enables the creation of
Software-defined networking (SDN) and NFV are transfor- virtual networks that can span across physical boundaries.
mative technologies that have revolutionized the way networks Various tunneling protocols are used in network virtualization,
are designed, deployed, and managed. Virtual switches play including Virtual Extensible LAN (VXLAN) [148], Geneve
crucial roles in enabling the flexibility, scalability, and effi- [149], Generic Routing Encapsulation (GRE) [150], etc. This
ciency that modern networks demand, especially to connect subsection will discuss VXLAN, but the idea generalizes
VMs. The networking functions implemented as NFVs on across all other tunneling protocols.
the server strain the CPU, especially in networks with high VXLAN establishes a virtual network (i.e., overlay net-
traffic rates. Recently, SmartNICs have been used to offload the work) over an existing layer-3 infrastructure (i.e., underlay
network functions from general-purpose CPUs. For instance, network) by creating tunnels between VMs. This overlay
SmartNICs have been used to offload switching/routing, tun- scheme enables the scalability of cloud-based services without
Offload Type Throughput (Gbps)
neling, measurement and telemetry, and others. the (Software)
Traditional necessity to add or reconfigure 40 the existing infrastructure.
However,
SmartNIC (Hardware) VXLAN introduces 390 an additional layer of packet
processing at the hypervisor level. Consider Fig. 30. Each
Offloadpacket leavingThroughput
the VM(Gbps)
must have an additional header to be
A. Switching Type
transported
Traditional (Software) over the underlay4 network. A VXLAN Tunnel End
Virtual switching emerged as a response to the need for
Point
SmartNIC (VTEP) device encapsulates
(Hardware) 0.1 during packet transmission
hypervisors to seamlessly connect VMs with the external
and decapsulates during packet reception. The VTEP is being
network [143]. Traditionally, virtual switches were running Offload Type Throughput (Gbps)
implemented on software as part of the hypervisor stack.
within the hypervisor, operating in software. However, this ap-Traditional (Software) 1
This
SmartNIC process incurs additional
(Hardware) 63.08 CPU overhead [152]. As the
proach proved to be CPU-intensive, impacting overall system
number of flows scales up, overloading the CPU with packets
performance and preventing optimal utilization of available
for encapsulation/decapsulation can easily lead to network
bandwidth [144], [145]. Offload Type Throughput (Gbps)
performance bottlenecks in terms of throughput and latency.
Software switches go beyond conventional layer-2 switching Traditional (Software)
SmartNICs can offload
1
tunneling functions from the host
SmartNIC (Hardware) 23.01
and layer-3 routing [146]; they facilitate rule matching on CPU [153]. They support inline encapsulation/decapsulation
various packet fields and support diverse actions on packets. of VXLAN and other tunneling protocols. This logic is im-
These actions include forwarding, dropping, marking, and
more.
1) Switching offload: SmartNICs, whether they use a NIC
Throughput (Mpps) Throughput (Mpps)
switch or a programmable pipeline, have lookups and ALUs 114B, 64 connections 114B, 250K connections
implemented in hardware. These components can be used to 80 25
implement the match-action functions required for switching 60 20
15
packets. Instead of re-implementing all the functions required 40
10
for switching, most SmartNICs allow offloading the datapath 20 5
of existing software switches, such as OvS [143], an open 0 0
source virtual switch. Note that it is possible to offload the Traditional SmartNIC Traditional SmartNIC
(Software) (Hardware) (Software) (Hardware)
datapath of proprietary switches, such as that of VMware’s
vSphere Distributed Switch [147]. Besides packet switching, (a) (b)

virtual switches can handle additional tasks such as Network Fig. 31. Throughput in million packets per second (Mpps) of software vs
Address Translation (NAT), tunneling, and QoS functionalities SmartNIC tunneling. (a) 114B packets and 64 connections; (b) 114B packets
such as rate limiting, policing, and scheduling. and 250,000 connections. Reproduced from [151].
22

Switch/router Switch/router TAP Switch/router Host Host

VM/Cont VM/Cont VM/Cont VM/Cont VM/Cont VM/Cont


# pkt
Port mirror
# bytes
# bits per sec
(a) (b) (c) ... Software switch NIC switch /
Programmable packet processor
Original packet Copied packet Flow record (e.g., NetFlow)
Traditional NIC SmartNIC

Fig. 32. (a) Port mirroring; (b) TAP; (c) NetFlow export. (a) (b)

Fig. 33. VM and containers observability with (a) software switches and (b)
SmartNICs.
plemented in the embedded NIC switch [154] or the pro-
grammable pipeline [55]. Tunnels definition, which is part of
the control plane, is implemented in software, on the CPU example, the SmartNIC can provide telemetry data containing
cores of the SmartNIC. This design not only improves through- the CPU, memory, and disk usage of the host.
put and reduces latency for the encapsulation/decapsulation 3) VM and Containers Observability with SmartNICs: Ex-
operations, but also frees up CPU cycles on the host for other ternal approaches to packet observability cannot observe inter-
tasks. Fig. 31 compares the tunneling performance between the VM/container traffic within the same server. While software-
software and a BlueField SmartNIC, reproduced from [151]. based approaches for monitoring VMs and containers exist, see
With 114-byte packets and 64 connections, the SmartNIC Fig. 33 (a), they often burden the CPU, especially with high
tunneling is ∼60 times higher than the software-based. With traffic rates [163]. SmartNICs provide hardware visibility on
114-byte packets and 250,000 connections, the SmartNIC traffic between VMs or containers within the same server (see
tunneling is ∼20 times higher than the software-based. Fig. 33 (b)), alleviating the CPU burden on the host.

Plaintext
C. Observability - Monitoring and Telemetry
packet
Host D. Load Balancing
V V V
Observability is the ability to collect
M M and
extract telemetry
M Load balancers play a crucial role in modern cloud envi-
information. During a network outage, effective observability ronments by distributing network requests across servers in
facilitates diagnosing and troubleshooting problems. It can also
Software switch
data centers efficiently. Traditionally, load balancers relied on
help in detecting malicious events Traditional
and identifying network specialized hardware, but now software-based solutions are
NIC
performance bottlenecks. prevalent among cloud providers. This shift offers flexibility
Traditional packet observability solutions are typically im- and allows for on-demand provisioning on standard servers,
plemented in hardware, situated outside the server. Examples though it comes with higher provisioning and operational
include configuring port mirroring (e.g., Switched Port Ana- expenses. While software-based load balancers offer greater
lyzer (SPAN)) on switches/routers, see Fig. 32 (a), deploying customization and adaptability compared to hardware-based
network TAPs for replicating packets, see Fig. 32 (b), and counterparts, they also entail considerable costs for cloud
exporting flow-based statistics using NetFlow [155] or IPFIX providers due to server purchase expenses and increased
[156] to a remote collector, see Fig. 32 (c). energy consumption.
1) Offloading Packets Observability to SmartNICs: The Load balancers are categorized into two main types: Layer
traditional approaches to packet observability are all supported 4 (L4) and Layer 7 (L7). L4 load balancers function at
by SmartNICs. SmartNICs can mirror packets and send them the transport layer of the network stack. They associate a
to remote collectors. They can also export telemetry using Virtual IP address (VIP) with a list of backend servers, each
flow-based telemetry solutions like NetFlow or IPFIX, or using having its own dynamic IP (DIP) address. Routing decisions
packet-level telemetry streaming such as In-band Network made by L4 load balancers are based solely on the packet
Telemetry (INT) [157] and In-situ OAM [158]. SmartNICs headers of the transport/IP layers, considering factors such as
can also monitor and aggregate telemetry locally, which avoids source and destination IP addresses and ports. Thus, L4 load
excessive traffic exports. Furthermore, since they incorporate balancers do not inspect the payload content of the packets.
programmable pipelines, they can be used to implement more On the other hand, L7 load balancers operate at a higher
complex packet telemetry than the traditional ones. For exam- layer, specifically the application layer. These balancers are
ple, it is possible to implement streaming algorithms such as more intricate, as they analyze content within the packets,
the Count-min Sketch (CMS) [159] to estimate the number of particularly focusing on application-layer protocols like HTTP.
packets per flow in a scalable way, or a Bloom Filter [160] The L7 load balancer directs incoming requests to appropriate
to test the occurrence of an element in a set. Such telemetry backend servers based on the specific service being accessed.
information can be very useful for a variety of applications For instance, differentiation may occur based on URLs.
(e.g., security [161], performance analysis [162], etc.). 1) Offloading Load Balancing to SmartNICs: Several
2) Offloading System Observability to SmartNICs: The works have offloaded load balancing to SmartNICs. Cui
SmartNIC also offers supplementary telemetry data related to et al. [164] proposed Laconic, a system that improves the
the system in which it is located [163], such as the host. For performance of load balancing due to three key points: 1)
23

Header fields
Control plane
Received packet Data
N3 User plane N6
Network
function (UPF)
Hash 1 User Equipment Radio Access
function Indirection (UE) Network (RAN)
table
2 Fig. 35. 5G network architecture. The packet core is implemented as VNF
LSB
on general-purpose CPUs. The SmartNIC is being used to offload the UPF
Hash value functions.
3
1
on general-purpose CPUs rather than dedicated appliances.
N
General-purpose CPUs are not capable of guaranteeing high
throughput and low latency, which are the requirements and
Fig. 34. Receive Side Scaling (RSS). the Key Performance Indicators (KPI) of 5G networks.
1) UPF offload to SmartNIC: The SmartNIC can be used
to offload the UPF functions [173]. Specifically, the following
Lightweight network stack: unlike traditional L7 load bal- functions are offloaded: GTPU tunneling: the encapsulation
ancers, which heavily rely on the operating system’s TCP and decapsulation of packets run at line rate; policing: the
stack, Laconic opts for a lighter packet forwarding stack on SmartNIC will control the bit rates of the devices so that
the load balancer itself. This approach minimizes overhead they do not exceed the Maximum Bit Rate (MBR); statistics:
and leverages the end-hosts to achieve the desired end-to- the counters and metrics are calculated and used for billing
end properties; 2) Lightweight synchronization for shared purposes; QoS: the SmartNIC performs Differentiated Services
data structures: Laconic implements a concurrent connection Code Point (DSCP) on flows to enable 5G QoS; Load balanc-
table design based on the cuckoo hash table. This design ing: the SmartNIC balances the traffic to the corresponding
efficiently manages hash conflicts and reduces the number application; Network Address Translation (NAT): the Smart-
of entries needing probing during lookups; 3) Acceleration NIC translates IP addresses on traffic; etc.
with hardware engines: Laconic optimizes packet processing Offloading the UPF will not only improve throughput and
by transferring common packet rewriting tasks to hardware reduce latency, but it will also boost the number of users
accelerators. This strategy alleviates the processing burden per server (7x according to [173]) and lower the Capital
on the CPU cores of the SmartNIC. Huang et al. [165] Expenditure (CapEx) per user.
offloaded the load balancer to an FPGA-based SmartNIC.
The result shows that the system was able to load-balance at F. Summary and Lessons Learned
100Gbps. Chang et al. [166] described a scheme that finds an
optimal load balancing strategy for a network topology. It uses SmartNICs significantly improve the performance of net-
SmartNICs and programmable switches. Other works [167]– work functions and reduce their CPU consumption on the
[170] used variations of the methods above for load balancing hosts. The key takeaways are:
• The packet switching functions (i.e., matching header fields

2) Receive Side Scaling (RSS): SmartNICs commonly in- and taking actions), can be accelerated with SmartNICs.
clude an accelerator for RSS, which is a mechanism to This is because SmartNICs, whether they use a NIC switch
distribute incoming network traffic across multiple CPU cores. or a programmable pipeline, have lookups and ALUs im-
To achieve this, the SmartNIC calculates a hash value (Toeplitz plemented in hardware.
hash [171]) based on header fields (such as the five-tuple) of • The performance of tunneling operations (encapsula-

the received network packet, see Fig 34. The hash value’s tion/decapsulation) can be significantly improved when of-
Least Significant Bits (LSBs) are then used as indices for floaded to the SmartNIC. This also frees the CPU cores
an indirection table, the values of which are used to allocate that were previously used for performing the tunneling
the incoming data to a specific CPU core. Some SmartNICs operations.
allow steering packets to queues based on programmable filters • SmartNICs not only support traditional telemetry solutions

[172]. but also allow the developer to devise custom fine-grained


measurement schemes. They also enable inter-VM/container
packet observability and host metrics telemetry.
E. 5G UPF • Offloading the UPF of 5G improves the performance of
The User Plane Function (UPF) in 5G networks represents packet processing, increases the number of users per server,
the data plane within the packet core. It connects the User and decreases the per-user CAPEX.
Equipment (UE) from the Radio Access Network (RAN) to the • Instead of re-implementing all the switching functions,
data network, see Fig. 35. The UPF typically performs packet SmartNICs allow offloading the datapath of existing soft-
inspection, routing, and forwarding, and QoS enforcement. ware switches.
It processes millions of flows with a high connection rate. • Developers can devise custom packet processing algorithms
5G networks implement the packet core as VNF running not supported by existing software switches.
24

Host Host Storage target Storage target


Apps/VMs/Containers Apps/VMs/Containers Hardware (NIC) Hardware (SmartNIC)
RDMA queues
Network adaptor Buffer RDMA queues
Network
Crypto,CRC, ... NVMe driver Software (CPU) Buffer
adapter
NVMe driver Adapter driver
NVMe I/F SmartNIC
NVMe/TCP SW NVMe-oF
Accelerators CPU Cores NVMe-oF
LAN driver
Crypto NVMe/ Block layer NVMe SSD
NIC TCP SW Driver
LAN I/F CRC
NVMe SSD Driver

Network subsystem Network subsystem


NVMe NVMe queues NVMe NVMe queues
function Buffer function Buffer
SSD controller SSD controller
(a) (b)
(a) (b)
Fig. 36. (a) NVMe-oF initiator without offload. (b) NVMe-oF initiator
offloaded to the SmartNIC. Fig. 37. (a) NVMe-oF target without offload. (b) NVMe-oF target with
offload. Reproduced from [174]

IX. S TORAGE
host. Requests from applications are simply forwarded to a
Traditionally, storage devices were directly attached to indi-
lightweight NVMe driver on the host. The initiator stack on
vidual computers or servers. This method provided fast access
the SmartNIC leverages the hardware accelerators for tasks
to data but lacked scalability and centralized management.
like inline cryptography and CRC offloading. The TCP stack
Network Attached Storage (NAS) emerged as a solution to
can either remain on the CPU cores of the SmartNIC or be
these limitations. It involves connecting storage devices to
offloaded to the hardware itself, depending on performance
a network, allowing multiple users and clients to access the
considerations and SmartNIC capabilities. The division of
storage resources over the network. NAS provided file-level
NVMe-oF functions between hardware and software allows for
access to data. Storage Area Network (SAN) provides a
optimization based on performance and SmartNIC capabilities.
high-speed network that connects storage devices to servers,
Another offload to the SmartNIC is the NVMe-oF RDMA.
providing block-level access to storage resources. SANs offer
The NVMe/RDMA data path is implemented in the hardware,
higher performance and scalability compared to NAS.
with inline cryptography and CRC offloaded. This approach
Traditional remote storage mechanisms establish a connec-
offers a high-performance, low-latency solution.
tion between a local host initiator and a remote target. This
process heavily burdens the host CPU, leading to a significant
decrease in overall performance. SmartNICs can be used to B. NVMe-oF Target
offload the processing from the host CPU. Another offload opportunity is offloading the storage tar-
get functions. On a storage target such as JBOF supporting
NVMe-oF, there is a CPU positioned between the network
A. NVMe-oF Initiator and NVME SSDs, see Fig. 37 (a). This CPU runs software
Non-Volatile Memory Express (NVMe) is an interface responsible for converting NVME-over-Fabrics Ethernet or
specification for accessing a computer’s non-volatile storage InfiniBand signals to NVME PCIe signals. The software com-
media usually attached via the PCI Express bus. It is typi- prises various components, including a network adapter stack,
cally used for accessing high-speed storage devices like Solid NVME-over-Fabrics stack, operating system block layer, and
State Drives (SSDs). NVMe over Fabrics (NVMe-oF) extends NVME SSD stack. Both the network adapter and SSD utilize
NVME to operate over network fabrics such as Ethernet, queues and memory buffers to interface with different software
Fibre Channel, or InfiniBand. The NVMe initiator initiates stacks.
and manages communication with NVMe targets. It sends When a request originates from the network, it arrives at
commands to NVMe targets to read, write, or perform other the network adapter as an RDMA SEND with the NVME
operations. The NVMe target refers to the NVMe storage command encapsulated. The adapter then forwards it to its
device itself. driver on the target CPU, which further passes it to the NVMe-
Fig. 36 (a) shows the traditional method of NVMe-oF using oF target driver. The NVME command proceeds through the
the TCP protocol and a regular NIC. The entire NVMe-oF driver for the SSDs and then to the NVME SSD controller.
initiator software stack operates on the host. Tasks such as Subsequently, the response follows the reverse path through
cryptography and CRC computations further strain the host the software layers.
CPU and memory bandwidth. 1) NVMe-oF Target Offload: With the offload, the fast path
1) NVMe-oF Initiator Offload: The NVMe-oF initiator is shifted to the hardware on the SmartNIC. Instead of bur-
functionality can be offloaded to the SmartNIC (Fig. 36 (b)), dening CPU cycles with millions of Input/Output Operations
minimizing the overhead on the host. The SmartNIC exposes per Second (IOPS), the adaptor now handles the load using
a high-performance PCIe interface and NVMe interface to the specialized function hardware. Software stacks remain in place
25

100 10000

Compression Time (ms)


CPU Utilization (%)

75

50 1000

25

0 100
DEFLATE zlib SZ3 DEFLATE DEFLATE zlib SZ3 DEFLATE
(CPU) (CPU) (CPU) (SmartNIC) (CPU) (CPU) (CPU) (SmartNIC)

Fig. 38. CPU utilization during compression with various algorithms (DE- Fig. 39. Compression time with various algorithms (DEFLATE, zlib, SZ3)
FLATE, zlib, SZ3) on seven datasets. Reproduced from [175]. on seven datasets. Reproduced from [175].

for management traffic. The reduction in latency by removing • Due to the hardware accelerators in the SmartNIC (e.g.,
the software from the data path is by a factor of three [174]. compression, crypto), storage operations like compression,
Moreover, the CPU usage with offload is negligible. deduplication, and crypto will run faster than on the host’s
CPU.
• The SmartNIC can be deployed on the initiator or the
C. Compression and Decompression
storage target. In both deployments, the CPU usage on the
The surge in data volumes has caused performance bottle- hosting device is negligible, the latency is minimized, and
necks for storage applications. Data compression is a widely the number of IOPS is improved.
adopted technique that mitigates this bottleneck by reduc-
ing the data size. It encodes information using fewer bits X. C OMPUTE
than the original representation. Notably, machine learning, This section examines applications offloaded to the Smart-
databases, and network communication rely on compression NIC that are not specifically tailored to infrastructure func-
techniques—both lossless and lossy—to enhance their per- tions. Instead, these applications leverage the SmartNIC for
formance. Data compression is compute-intensive and time- accelerated computing tasks.
consuming, especially with large sizes of data to be com-
pressed.
A. Machine Learning
1) Offloading Compression to SmartNICs: SmartNICs in-
clude onboard hardware accelerators that enable the offloading State-of-the-art deep ML models have significantly ex-
of compression and decompression tasks from host CPUs. This panded in size, playing a critical role in various domains, in-
offloading alleviates the strain on host resources, resulting in cluding computer vision, Natural Language Processing (NLP),
savings and improved performance. Fig. 38 shows the CPU and others [47]. The scale of these models has seen a dramatic
utilization when compression is executed entirely on the host increase, with the number of parameters growing from 94
(denoted as CPU) versus when executed on the compression million in 2018 [179] to 174 trillion in 2022 [180]. This
hardware engine of the SmartNIC (denoted as SmartNIC). The exponential growth owes much to advancements in parallel
experiment shows results for various compression algorithms and distributed computing, enabling tasks related to model
(e.g., DEFLATE [176], zlib [177], SZ3 [178]) over seven training to be distributed across multiple computing resources
datasets. The datasets are sorted in the figure by their sizes in simultaneously. The practice of offloading parts of ML tasks
ascending order– each dataset is a column in the figure. The to network resources traces back to the 2000s [181], a trend
experiment is reproduced from [175]. When the compression that continued with the advent of Software Defined Network-
is executed entirely on the host, the CPU usage approaches ing (SDN), where ML primarily operates within the control
100%, especially with large datasets. With a SmartNIC, there plane [182]. The recent emergence of programmable data
is a significant reduction in the CPU utilization. planes (i.e., programmable switches, SmartNICs) has further
Fig. 39 shows the compression time needed when executed spurred research and practical applications toward offloading
entirely on the host (denoted as CPU) versus when executed ML phases, such as training and inference, to the hardware.
on the compression hardware engine of the SmartNIC. The Offloading ML tasks can occur on a single network device or
experiment is reproduced from [175]. With a SmartNIC, there across multiple devices, depending on network requirements
is a significant reduction in the compression time, regardless and the complexity of the offloaded ML task.
of the size of the dataset. 1) ML training: The training of large ML models can be
accelerated by following a distributed approach. This involves
computing gradients on each device based on a subset of the
D. Summary and Lessons Learned data, which are then aggregated to update model parameters.
Offloading storage functions to the SmartNICs improves the Additionally, optimization of model parameters can be carried
performance. out in the data plane to maximize accuracy.
26

Gradient Aggregation serving accuracy, as floating-point operations required by the


algorithm are supported at the hardware level. Similarly, Ma
Gradients Aggregate
Read/write et al. [186] improve distributed ML training by performing the
packet + entirety of allreduce operations on an FPGA-based SmartNIC
Packet in a ring network topology. The proposed approach compresses
Parameter optimization the gradient before they are shared with other nodes, thus
reducing bandwidth usage. The aggregation is performed on
Read/write x the SmartNIC of the end-host nodes.
+ 2) ML Inference: In the inference phase, various ML mod-
packet w
els such as decision trees, neural networks, and reinforcement
learning algorithms undergo training on a general-purpose
CPU. Once trained, these models are translated into rules
GPU that can be executed within the data plane of the device.
PCIe
This approach enables accelerated inference, enhancing the
Fig. 40. High-level architecture of [183]. efficiency of real-time decision-making processes.
IIsy [187] explore the feasibility of deploying different clas-
sification algorithms on programmable data planes. In particu-
lar, IIsy can implement decision trees, K-means, Support Vec-
Itsubo et al. [183] devised a system that employs in-network
tor Machine (SVM), and Naı̈ve Bayes to perform per-packet
gradient aggregation and parameter optimization for neural
classification. The framework converts the code into match-
networks using an FPGA-based SmartNIC. Fig. 40 shows the
action tables compatible with programmable data planes. IIsy’s
high-level architecture of [183]. Gradient computation occurs
prototype is implemented over an FPGA-based SmartNIC
on GPUs and is subsequently transmitted to the FPGA via
using P4 [188]. Xavier et al. [189] developed a framework that
PCIe, where aggregation takes place. Following aggregation,
translates decision trees into a P4-programmable data plane
the FPGA can execute parameter optimization algorithms,
using if-else chain of statements. The proposed framework dif-
including Stochastic Gradient Descent, Adagrad, Adam, and
fers from [187] by introducing per-flow classification. BaNaNa
SMORMS3. The proposed framework achieves aggregation
Split [190] accelerates neural networks inside programmable
at 98.5% line rate and accelerates parameter optimization by
switches and SmartNICs. This approach leverages the layered
≈ 1.2 times. Tanaka et al. [184] opt for a different approach
structure of neural networks by splitting them between the
to aggregation, employing a ring network topology of FPGA-
CPU and the network processor. However, BaNaNa Split
based SmartNICs using the allreduce algorithm [185]. In this
necessitates quantization, a process that reduces the precision
setup, each node within the ring network awaits data from
of neural network weights at the cost of diminishing accuracy.
the preceding node, aggregates it upon reception using the
all-reduce aggregation algorithm, and subsequently forwards B. Key-value Stores
it to the succeeding node (as illustrated in Fig. 41). This
strategy not only alleviates CPU load but also establishes a Data centers face a growing demand for collecting and
direct memory link between the GPU and the FPGA, thus pre- analyzing vast amounts of data. Typically, this data is stored
in key-value stores due to their superior performance over tra-
ditional relational database systems. Popular key-value stores
include Redis [191] and Memcached [192]. As data volumes
+ increase, so does the frequency of reads and writes to these
SRAM stores, leading to bottlenecks in the traditional network proto-
col stack and heavy CPU consumption.
DMA The emergence of SmartNICs offers a solution by offload-
FPGA ing key-value store operations to accelerate performance and
reduce CPU load. One effective method is leveraging RDMA
GPU
[193]–[197]. RDMA allows data to be read from or written
CPU to memory without involving the operating system and the
traditional network stack. This results in lower latency, reduced
+ +
CPU overhead, and higher bandwidth compared to traditional
SRAM SRAM
networking approaches.
Sun et al. [198] implemented SKV, a distributed key-value
DMA DMA
store accelerated with SmartNIC. The system offloads the data
FPGA FPGA replication and failure detection components. It targets the
GPU GPU Redis key-value store and is implemented using the BlueField
CPU CPU SmartNIC. The evaluations show that the system reduces
the latency by 21% and increases the throughput by 14%
compared to being implemented fully on the host without
Fig. 41. High-level architecture of [183]. SmartNIC acceleration.
27

Server Container Serverless Isolate


SmartNIC Host
virtualization virtualization compute functions
FPGA
Networking CPU Apps Apps Apps Apps Apps Apps
Ethernet PCIe
Transport RDMA (put, update, Bins/libs Bins/libs Bins/libs Bins/libs Serverless engine
connector DMA
delete) Guest OS Guest OS Container engine
Host operating system (OS)
B-Tree Accelerator
(scan, get)
Hardware
Host Memory
Onboard (16GB)
DRAM (4GB) Memory subsystem Fig. 44. Architectures used in the cloud. Reproduced from [204].
Cache PageTable

C. Transaction Processing
Fig. 42. Honeycomb system architecture [199]. High-performance transaction processing is important to
enable various distributed applications. These systems need
to manage a large number of requests from the network
Another aspect of the key-value store that was offloaded to efficiently. One crucial aspect is determining how to schedule
the SmartNIC is the ordering of elements. Ordered key-value each transaction request to the most suitable CPU core.
stores enable additional applications by allowing an efficient Consider Fig. 43 (a) which shows the architecture of a trans-
SCAN operation. Liu et al. [199] proposed Honeycomb, an action processing system without scheduling. A traditional
FPGA-based system that provides hardware acceleration for NIC receives requests from the clients and dispatches them
an in-memory ordered key-value store. It focuses on the to the worker threads. The worker threads then execute the
read-dominated workloads. Consider Fig. 42. The B-Tree transaction, while considering the contention issues that might
accelerator implements the GET and SCAN operations. The happen. Contention in this context means that two workers are
CPU executes the PUT, UPDATE, and DELETE operations. accessing the same data and at least one of them is issuing
The B-Tree is stored on the onboard DRAM in FPGA and a write. In Fig. 43 (a), two transactions (txn0 and txn1 ) are
on the memory of the host. Storing the B-Tree on the host writing to the same data blocks A and C. In such a scenario,
allows better scalability since its memory is larger than that the transactions are typically aborted, causing the clients to
of the FPGA. The memory subsystem maintains a cache and resend the transactions, which degrades the performance.
communicates with the onboard DRAM. It also communicates Li et al. [202] proposed using a SmartNIC to schedule the
with the host memory using PCIe. The implementation shows transactions to the appropriate worker threads. The Smart-
that the system increases the throughput of another ordered NIC maintains the runtime states, giving it the flexibility to
key-value store [200] by 1.8x. make accurate scheduling decisions. The SmartNIC queues
Chen et al. [201] designed a heterogeneous key-value store the transactions belonging to the same worker thread. This
where a primary instance runs on the host and a secondary avoids having the clients resend the transactions. The system is
instance runs on a SmartNIC. The system identifies the popular implemented on an FPGA-based SmartNIC, which further re-
items and replicates them to the SmartNIC. The popular duces the scheduling overhead. The system was implemented
items are identified with moving window access counters. over the Innova-2 SmartNIC and the results show that the
The server instance serves the read and write requests of throughput is boosted by 2.68x and the latency is reduced by
all keys while the SmartNIC instance serves only the read 48.8% compared to the CPU-based scheduling.
request of popular items. This system targets read-intensive Schuh et al. [203] implemented Xenic, a SmartNIC-based
workloads with skewed access. The system was implemented system that applies an asynchronous, aggregated execution
on a BlueField-2 and the results show that the throughput is model to maximize network and core efficiency. It uses a
improved by 1.86x than a standalone RDMA key-value store. data store on both the SmartNIC and the host. This data
store provides fast access to host data via indexing. It also
maintains a state to enhance the concurrency and contention
clients clients issues. Xenic also aggregates work at all inputs and outputs
clients clients
of the SmartNIC to achieve communication efficiency. The
txn0: txn1:
SmartNIC system was implemented on a LiquidIO SmartNIC. The results
{A=1, C=0} {A=0, C=1} Runtime states
show that Xenic improves the throughput of prior RDMA-
id=sched(txn)
based systems by approximately 2x, reduces the latency by
txn2 txn0,1 up to 59%, and saves server threads.
worker threads
worker threads
D. Serverless Computing
Contention
data:{A, B, C, …} data:{A, B, C, …} Figure 44 shows the architectures used in the cloud today.
(a) (b)
The server virtualization allows guest operating systems to run
on top of a host operating system. The applications and the
Fig. 43. Transaction processing systems. (a) a system without scheduling, (b) libraries run on top of the guest OS and are isolated from other
scheduling using SmartNIC. Reproduced from [202]. operating systems. The trend has shifted towards container
28

[210], [211] [212]


Service GW

Workload [108]

manager [208] [58]


Lambda Lambda [110]

SmartNICs Servers Workload


collector Architectural Non-optimized
diversity SmartNICs P4 codes
... ... challenges
Spike mgmt Slow path and trends Performance
bottleneck unpredict.
Heterogeneous data and compute plane Control plane
[209] [213]

Fig. 45. SpikeOffload system architecture. Reproduced from [22].

virtualization where applications and their libraries are isolated


[186], [217]
but they share the same OS. The containers can be connected [214]–[216]

through software switches to enable their communications.


The complexity and the scale of the cloud make it hard to Fig. 46. Challenges and future trends. The references represent examples of
existing works that tackle the corresponding future trends.
manage and provision the infrastructure with tasks requiring
fine-grained allocation of resources under changing workload
demands. This has led to the serverless compute architecture, on the service time and the CPU loads of the servers and the
also known as Function as a Service (FaaS). In a serverless SmartNICs. It then configures the service gateway (GW) to
architecture, developers write code that represents functions distribute the requests to the corresponding device (i.e., servers
(also known as Lambdas), and these functions are triggered and SmartNICs) in the compute plane. SpikeOffload predicts
by various events. The users are billed only for the resources the spikes in the workloads using ML. It starts the containers
consumed during the execution. The serverless workloads, before the actual load arrives to mitigate the containers’ cold
which are targeted by these functions, are typically short-lived start latency. The system was implemented on a BlueField-2,
with strict computing and memory limits. The cloud provider and the results show that the Service Level Agreement (SLA)
will manage the infrastructure by creating containers and violations for certain workloads can be reduced by up to 20%.
taking them down when the workload is completed. Examples
of serverless computing frameworks include Amazon Lambda
E. Summary and Lessons Learned
[205], Google Cloud Functions [206], and Microsoft Azure
Functions [207]. SmartNICs extend their utility beyond infrastructure-related
Running the serverless computing functions on top of con- tasks, accelerating various compute functions. The key take-
tainers that are running on top of an OS incurs processing and aways include:
networking overhead that increases the latency. Recently, cloud • Machine learning tasks, encompassing distributed training

providers have been using the isolate functions architecture in and inference, experience significant performance enhance-
which the functions are executed on a bare metal server. ments when offloaded to SmartNICs. These devices effi-
1) Executing Lambda Functions on SmartNICs: Recent ciently aggregate model updates from multiple ML work-
efforts have explored the potential of executing Lambda ers and optimize model parameters. Their programmable
functions on the SmartNICs. Choi et al. [204] proposed λ- pipeline also enables the execution of certain ML models
NIC, a framework where Lambda functions are executed on directly for line-rate inference.
• Key-value stores operations, which include retrieving and
the SmartNIC. It provides a programming abstraction, which
resembles the match-action of the P4 language, to express updating data, replicating stores, and detecting failures,
the lambda functions. The framework analyzes the memory can be offloaded to SmartNICs. This would bring notable
accesses of the functions to map them across the memory throughput and latency improvements.
• SmartNICs can be used to schedule transactions, aggregate
hierarchy of the SmartNIC. Because the workloads are short-
lived, λ-NIC assigns a function to a single core on the values, and solve contention in distributed systems, improv-
SmartNIC. The system was implemented on a Netronome ing the latency and throughput.
• SmartNICs can execute serverless workloads (lambda func-
Agilio CX, and the results show that λ-NIC can decrease the
average latency by 880x and improve the throughput by 736x. tions), which reduces the load on the servers. They can also
be used as an additional execution engine in a heterogeneous
Tootaghaj et al. [22] proposed SpikeOffload, a system that
data and compute cluster.
offloads serverless functions to the CPU cores of the Smart-
NICs, in the presence of transient traffic spikes, see Fig. 45.
A workload collector module gathers the history of workloads XI. C HALLENGES AND F UTURE T RENDS
and feeds the summary to the workload manager module. The In this section, several research and operational challenges
workload manager module predicts the workload spikes based that correspond to SmartNICs are outlined. The challenges are
29

extracted after comprehensively reviewing and diving into each Pipeleon


work in the described literature. Further, the section discusses User’s P4 Top-k Optimization
and pinpoints several initiatives for future work that could be program pipelets
worthy of being pursued. The challenges and the future trends
are illustrated in Fig. 46 .p4 .p4
Instrument
Profiling .p4

A. Architectural Diversity and Vendor Specificity


SmartNICs can have different architectural models, each
requiring unique programming approaches. Even within the
Lower-level compilers
same architecture, SmartNICs from different vendors may ne-
cessitate proprietary SDKs and distinct programming methods,
which present several challenges: SmartNIC

• Vendor Lock-In: Developers may become dependent on a


Fig. 47. Pipeleon workflow. Reproduced from [58].
specific vendor’s SDK, making it challenging to migrate
to alternative SmartNICs or adopt new technologies. For
instance, consider the scenario where a developer has writ- for switch ASICs which have a different execution model
ten a packet processing logic using DOCA for BlueField than SmartNICs. With switch ASIC, if the program compiles,
SmartNICs. If they were to transfer this logic to Xilinx the packet processing executes at line rate. SmartNICs on
FPGA-based SmartNICs, they would need to rewrite the the other hand follow the run-to-completion model, where
entire logic from scratch. packets are assigned to a particular processing engine during
• Reduced Collaboration: Proprietary SDKs hinder collabora-
the lifetime. With multicore SmartNICs, the packets may
tion and knowledge shared among developers, as expertise experience variable latencies depending on the complexity of
gained in one ecosystem may not be easily transferable to the program and its execution paths.
another. Current and Future Initiatives: A noteworthy work by Xing
• Increased Development Time and Costs: When developers
et al. [58] presented an automated performance optimization
need to tailor their code for each SmartNIC’s proprietary framework (Pipeleon) for P4 programmable SmartNICs, see
SDK, it significantly increases development time and costs. Fig. 47. The framework uses profile-guided optimizations to
Instead of focusing on advancing the functionality and adapt the P4 program based on the runtime profiles (e.g., traffic
performance of their applications, developers must spend patterns, and table entries). The input to this framework is a P4
valuable resources adapting their code to work with differ- program which is then partitioned into smaller code snippets
ent SmartNIC architectures and vendor-specific APIs. This called pipelets. The framework leverages the reconfigurability
diversion of resources can slow down the pace of innovation of the SmartNICs (e.g., those that follow the disaggregated
within organizations and the industry as a whole. dRMT architecture [218], [219]) to realize a more efficient
Current and Future Initiatives: To address these challenges implementation. The framework was tested with BlueField2
and foster innovation in the SmartNIC space, there is a and Agilio CX SmartNICs and the results show that the opti-
growing need for standardized programming interfaces and mizations significantly improve the SmartNIC performance in
open-source development frameworks. Standardization efforts various use cases by up to 5x. Due to such results, it would be
could promote interoperability among SmartNICs from differ- beneficial to improve the existing P4 compilers to be tailored
ent vendors and enable developers to write code that is portable to SmartNICs and to consider runtime profiles.
across various architectures. Additionally, open-source ini-
tiatives can encourage collaboration, drive community-driven C. Complex Functions Offloading
innovation, and provide developers with more flexibility and
control over their software stack. Several initiatives (e.g., Effectively utilizing SmartNICs for running offloaded func-
OPI, IPDK, SONiC-DASH) aim to establish standard APIs tions presents several challenges. First, SmartNICs have lim-
for SmartNIC programming and administration, reducing ven- ited computational and memory resources, which restricts the
dor dependency. However, vendor-specific functions remain a number of functions that can be accommodated on them.
challenge for generalization.

F3 F2
B. Non-optimized P4 Codes F2 F4 F3 F4
F7 F7
Developers have been using low-level optimization to en- F1 F1
hance the performance of packet processing in SmartNICs. F5 F6 F5 F6
Recently, vendors are embracing P4 as a uniform program-
(a) (b)
ming model for SmartNICs [55], [81], [94]. While P4 allows
ease of programming and offers a high-level standardized Fig. 48. (a) Non-optimized placement; (b) optimized placement. The inter-
model, it does not guarantee the optimal performance on device transmissions (red arrows) between SmartNIC and CPU lead to
SmartNICs. This is because the P4 compilers were optimized additional element graph latency. Reproduced from [211].
30

Logically-centralized Logically-centralized
16
Normalized latency

Control Plane Control Plane


8
Runtime API
4
Runtime API
2 Slow Path CPU

1 RPC/PCIe
ASIC
Data Plane Data Plane FPGA
NAT DPI FW LPM HH
(a) SDN in theory (b) SDN in reality
Fig. 49. Normalized latency for different implementations of functions:
Network Address Translation (NAT), DPI, Firewall (FW), LPM, Heavy Hitter
Fig. 50. (a) SDN in theory; (b) SDN in reality. Reproduced from [209].
(HH). Reproduced from [213].

Second, although it is technically possible to host switching an unported function on a hypothetical SmartNIC target.
exclusively on the SmartNIC, doing so incurs considerable Initially, it constructs a model for a given SmartNIC. Then,
latency costs for packets moving between the functions de- it creates performance profiles for that SmartNIC by con-
ployed on the SmartNIC and those on the host. This is due to ducting hardware microbenchmarks, which encompass tests
the overhead caused by the multiple traversals across the host on memory latency, accelerator throughput, etc. Clara then
PCI bus. Third, distributing the functions between the host and creates a code and examines it to identify segments that could
the SmartNIC introduces management challenges. be fully offloaded to the SmartNIC. It evaluates the optimal
Current and Future Initiatives: Le et al. [210] presented mapping by incorporating constraints derived from the logical
UNO, a system that splits switching between the host software NIC model, performance parameters, and code segments. By
and the SmartNIC. It uses linear programming formulation resolving these constraints, Clara can establish a mapping that
to determine the optimal placement for functions. UNO uses optimizes performance after porting. Finally, Clara tests with
the traffic pattern and the load of the function as input. The a PCAP file and assesses how packets would traverse the
experiments show that the savings in processors is up to eight mapping, thereby providing predictions regarding latency and
host cores. UNO also reduces power by 2x. Another work throughput.
by Wang et al. [211] optimizes the placement of functions
according to the processing and the transmission latency. The E. Poor Security Isolation
system analyzes the dependencies and formulates the partition Commodity SmartNICs suffer from poor isolation between
and placement problem using 0-1 linear programming. The offloaded functions and between functions and data center
system minimizes The inter-device transmissions between the operators [212]. This limitation is a result of the limited access
SmartNIC and the CPU, see Fig. 48. controls on the NIC memory and the absence of virtualization
for hardware accelerators. These shortcomings compromise
D. Performance Unpredictability the robustness and security of individual functions, especially
in a multi-tenant environment. Additionally, any buggy or
When offloading a function to SmartNICs, developers must
compromised code within the NIC poses a risk to all other
refactor the core logic to align with the underlying hard-
functions running on it. Concrete attacks on popular Smart-
ware. Determining the optimal offloading strategy may not be
NICs including packet corruption, DPI rules stealing, and IO
straightforward. Moreover, the performance of ported func-
bus denial of service, are presented in [212].
tions can vary among developers, relying heavily on their un-
Current and Future Initiatives: Zhou et al. [212] proposed S-
derstanding of NIC capabilities, see Fig. 49. For instance, us-
NIC, a hardware design that enforces disaggregation between
ing the flow cache can offer orders of magnitude improvement
resources. S-NIC isolates functions at both the ISA level and
in latency compared to DRAM [213]. This is entirely related to
the microarchitectural level. This ensures integrity and confi-
how the programmer implements the code. The performance is
dentiality, as well as mitigating against side-channel attacks.
also influenced by traffic workloads (e.g., flow volumes, packet
The design is cost-effective and requires minimal changes to
sizes, arrival rates). Additional functions on the SmartNIC can
the hardware (e.g., die area). However, it still incurs modest
pose further challenges, particularly with memory-intensive
degradation in the performance. Future work could explore
functions potentially impacting cache utilization for others,
alternative architectures that have less impact on performance,
and compute-intensive functions potentially causing head-of-
or other software-based techniques to isolate the resources.
line blocking at accelerators [213]. All these factors often
lead to unexpected performance fluctuations when migrating
a function to a SmartNIC. While benchmarking the program F. Slow Path Bottleneck
will produce performance results, it requires that the program Over recent years, there has been a continuous improvement
be already developed on the SmartNIC. in the performance of packet-processing data planes, leading
Current and Future Initiatives: Performance prediction can to their predominant implementation in hardware such as
help the developer gain insight prior to porting the code SmartNICs and programmable switches. Yet, there has been a
to the hardware. Clara [213] predicts the performance of lack of focus on the slow path, the interface between the data
31

plane and the control plane, which is traditionally considered can also be used in complex neural network models that
non-performance critical. The slow path is responsible for need to be simplified to fit in the data plane. To reduce
handling a subset of traffic that requires special processing communication overhead, Ma et al. [186] compresses the
(complex control flow, compute, memory resources). These parameters (i.e., gradients) before sharing them in the network.
tasks cannot be executed on the data plane, see Fig. 50. The Such approaches can enhance network performance, especially
slow path is executed on the CPU cores, whether on the host when numerous networking devices are cooperating.
or the SmartNIC.
Lately, the slow path is becoming a major bottleneck, driven H. Lack of Training Resources
by the surge in physical network bandwidth and the increasing
There is an evident lack of detailed documentation and train-
complexity of network topologies. There is a growth in slow-
ing resources that adequately cover SmartNIC programming
path traffic in tandem with user traffic.
and configuration. While some vendors may provide reference
Current and Future Initiatives: There is a need to re-evaluate
applications, basic documentation, and training courses (e.g.,
the current approach to balancing workload distribution be-
[220]), they often fall short of providing the in-depth expla-
tween the data plane and the slow path. Zulfiqar et al. [209]
nations and hands-on experience that developers need. This
articulated the limitations of the current slow path and argued
makes it difficult for newcomers to understand the intricacies
that the solution is to have a domain-specific accelerator for
of SmartNIC development and configuration.
the slow path. A challenge with creating such an accelerator
Current and Future Initiatives: To address this issue, it
is to design a generic architecture with common primitives
is essential for vendors to invest in creating comprehensive
that support most of the slow path use cases. Ideally, the
training materials, including detailed documentation, tutorials,
accelerator would have predictable response times, fast table
and hands-on labs. These resources should cover various
updates, and support large memory pools. Further, the paper
aspects of SmartNIC programming and configuration, from
advocates extending the match-action model found in most
basic concepts to advanced techniques. Additionally, vendors
packet processing devices to match-compute for the slow path.
could offer interactive online courses or workshops led by
G. ML Offload Complexity experienced instructors to provide personalized guidance and
support for learners. Some YouTube channels are posting the
Offloading the training or the inference in ML from the
latest advances and updates on SmartNICs (e.g., STH [214],
CPU/GPU to SmartNICs comes with a set of challenges that
SNIA [215], OPI [216]). However, they are still not compre-
limit the scalability and innovation of the deployed models.
hensive enough to allow a beginner to start experimenting with
• Accuracy vs Compatibility tradeoff. Some hardware archi-
SmartNICs.
tectures do not support floating-point numbers and complex
operations, which are required by advanced ML models,
XII. C ONCLUSION
such as neural networks. Workarounds that are proposed
to overcome these limitations come at the expense of The evolution of computing has encountered significant
sacrificing the accuracy of the ML model. challenges with the end of Moore’s Law and Dennard Scal-
• Restriction on the adopted ML algorithm. Despite the con- ing. The emergence of SmartNICs, which combine various
tinuous exploration of deploying ML models, such as neural domain-specific processors, represents a pivotal shift towards
networks and decision trees in SmartNICs, a multitude of offloading infrastructure tasks and improving network effi-
algorithms, such as Principal Component Analysis (PCA), ciency. This paper has filled a critical void in the literature by
Genetic Algorithms, are yet to be explored. Additionally, providing a comprehensive survey of SmartNICs, encompass-
models that are currently deployed are static and any update ing their evolution, architectures, development environments,
to the model requires temporarily halting the programmable and applications. The paper has delineated the wide array of
network device until the new model is compiled and pushed. functions offloaded to SmartNICs, spanning network, security,
• Flexibility of aggregate functions: In the context of training storage, and compute tasks. The paper has also discussed
ML models, the traditional aggregate functions are ‘min’, the challenges associated with SmartNIC development and
‘max’, ‘count’, ‘sum’, and ‘avg’. However, over time, deployment, and pinpointed key research initiatives and trends
several approaches started adopting and providing user- that could be explored in the future. Evidence suggests that
defined aggregate functions. Implementing such functions SmartNICs are poised to become integral components of every
over some hardware architectures used in SmartNICs is not network infrastructure. Smaller networks, which often lack
straightforward. deep technical expertise, can leverage SmartNICs for offload-
Current and Future Initiatives: Migration of functionality ing routine infrastructure tasks. On the other hand, larger and
is one technique that can overcome the restrictions of up- research-oriented networks, with experienced developers, will
dating the data plane on the fly. For instance, before the leverage SmartNICs for offloading complex tasks that are not
programmable network processor is updated, its functionalities well-suited for general-purpose CPUs.
are migrated to another device so that network communica-
tion is not interrupted. To deal with the lack of support of ACKNOWLEDGEMENT
floating-points, approaches such as [217] translate floating- This work is supported by the National Science Foundation
point numbers to integers using quantization (i.e., a fixed- (NSF), Office of Advanced Cyberinfrastructure (OAC), under
point representation of decimal numbers). Such technique grant numbers 2118311, 2403360, and 2346726.
32

Abbreviation Term
TABLE XII
A BBREVIATIONS USED IN THIS ARTICLE . SDK Software Development Kit
SDN Software Defined Network
Abbreviation Term SoC System on a Chip
SPAN Switched Port Analyzer
ACL Access Control List
SPDK Storage Performance Development Kit
AES Advanced Encryption Standard
SSD Solid State Drives
ALU Arithmetic Logic Unit
SVM Support Vector Machine
ANOVA Analysis of Variance
TCP Transmission Control Protocol
API Application Programming Interface
TLS Transport Layer Security
ASIC Application Specific Integrated Circuit
TM Traffic Manager
BCC BPF Compiler Collection
TRNG True Random Number Generator
BPF Berkeley Packet Filter
TSO TCP Segmentation Offload
CLI Command Line Interface
uBPF Userspace BPF
CMS Count-min Sketch
UE User Equipment
CPU Central Processing Unit
UPF User Plane Function
DNN Deep Neural Network
URL Uniform Resource Locator
DIP Dynamic IP
VIP Virtual IP
DOCA Data Center-on-a-Chip Architecture
VM Virtual Machine
DPDK Data Plane Development Kit
VPP Vector Packet Processor
DPI Deep Packet Inspection
VTEP VXLAN Tunnel End Point
DPU Data Processing Unit
VXLAN Virtual Extensible LAN
DRAM Dynamic Random Access Memory
XDP eXpress Data Path
eBPF Extended Berkeley Packet Filter
xPU Auxiliary Processing Unit
ESnet Energy Sciences Network
FPGA Field Programmable Gate Array
GPU Graphics Processing Units
GRE Generic Routing Encapsulation R EFERENCES
GUI Graphical User Interface
HDL Hardware Description Language [1] G. Moore, “Cramming more components onto integrated circuits,”
HPC High Performance Computing Proceedings of the IEEE, 1998.
IDE Integrated Development Environment [2] G. Moore, “Progress in digital integrated electronics,” in Electron
IDS Intrusion Detection System Devices Meeting, 1975.
IP Internet Protocol [3] R. Dennard, F. Gaensslen, H. Yu, V. Rideout, E. Bassous, and
IPDK Infrastructure Programmer Development Kit A. LeBlanc, “Design of ion-implanted MOSFET’s with very small
IPU Infrastructure Processing Unit physical dimensions,” IEEE Journal of solid-state circuits, 1974.
IPS Intrusion Prevention System [4] J. Hennessy and D. Patterson, Computer architecture: a quantitative
IPSec Internet Protocol Security approach. Elsevier, 2011.
IT Information Technology [5] G. Amdahl, “Validity of the single processor approach to achieving
JBOF Just a Bunch of Flash large scale computing capabilities,” in Proceedings of the April 18-20,
KPI Key Performance Indicators 1967, spring joint computer conference, 1967.
kTLS Kernel TLS [6] J. Faircloth, “Enterprise applications administration: The definitive
LAN Local Area Network guide to implementation and operations,” Morgan Kaufmann, 2013.
LUT Lookup Table [7] S. Ibanez, M. Shahbaz, and N. McKeown, “The case for a network
LPM Longest Prefix Matching fast path to the CPU,” in Proceedings of the 18th ACM Workshop on
LSB Least Significant Bit Hot Topics in Networks, 2019.
MBR Maximum Bit Rate [8] M. Metz, “SmartNICs and infrastructure acceleration report 2022,”
ML Machine Learning AvidThink, 2022.
NAS Network Attached Storage
[9] A. Ageev, M. Foroushani, and A. Kaufmann, “Exploring domain-
NAT Network Address Translation
specific architectures for network protocol processing,”
NFV Network Function Virtualization
[10] E. Tell, “A domain specific DSP processor,” Institutionen för sys-
NGFW Next-Generation Firewall
temteknik, 2001.
NIC Network Interface Card
[11] D. Caetano-Anolles, “Hardware - optimizations - SSD - CPU - GPU
NLP Natural Language Processing
- FPGA - TPU,” gatk, 2022.
NVMe Non-Volatile Memory Express
[12] G. Elinoff, “Data centers are overloaded. the inventor of FPGAs is
NVMe-oF Non-Volatile Memory Express over Fabric
swooping in with a “comprehensive” SmartNIC,” March 2020.
OFS Open FPGA Stack
OPAE Open Programmable Acceleration Engine [13] Google, “Encryption in transit.” [Online]. Available: https://tinyurl.co
OPI Open Programmable Infrastructure m/436vh9jh.
OS Operating System [14] J. Morra, “Is this the future of the SmartNIC?.” [Online]. Available:
OvS Open vSwitch https://tinyurl.com/ydru5bcp.
P4 Programming Protocol-independent Packet Processor [15] Microsoft, “Azure SmartNIC.” [Online]. Available: https://tinyurl.co
PCIe Peripheral Component Interconnect Express m/4sj7m7mp.
PISA Protocol Independent Switch Architecture [16] S. Schweitzer, “Architectures, boards, chips and software,” SmartNIC
PMD Poll Mode Driver Summit, 2023.
PNA Portable NIC Architecture [17] AMD, “AMD collaborates with the energy sciences network on launch
PSA Portable Switch Architecture of its next-generation, high-performance network to enhance data-
QoS Quality of Service intensive science,” 2022. [Online]. Available: https://tinyurl.com/
RAM Random Access Memory ycyb382t.
RAN Radio Access Network [18] VMware, “DPU-based acceleration for NSX.” [Online]. Available: ht
RDMA Remote Direct Memory Access tps://tinyurl.com/238v6j5h.
RPC Remote Procedure Call [19] Palo Alto Networks, “Intelligent traffic offload uses smartnic/dpu for
RSS Receive Side Scaling hyperscale security,” 2022. [Online]. Available: https://tinyurl.com/d3
RTL Register Transfer Level 22nda7.
SAN Storage Area Network [20] Juniper Networks, “SmartNICs accelerate the new network edge,”
2021. [Online]. Available: https://tinyurl.com/2uh6uh7t.
[21] S. Vural, “SmartNICs in telco: benefits and use cases,” 2021. [Online].
Available: https://tinyurl.com/8amw8s74.
33

[22] D. Tootaghaj, A. Mercian, V. Adarsh, M. Sharifian, and P. Sharma, [47] R. Parizotto, B. Coelho, D. Nunes, I. Haque, and A. Schaeffer-
“SmartNICs at edge for transient compute elasticity,” in Proceedings Filho, “Offloading machine learning to programmable data planes: A
of the 3rd International Workshop on Distributed Machine Learning, systematic survey,” ACM Computing Surveys, 2023.
2022. [48] W. Quan, Z. Xu, M. Liu, N. Cheng, G. Liu, D. Gao, H. Zhang, X. Shen,
[23] C. Zheng, X. Hong, D. Ding, S. Vargaftik, Y. Ben-Itzhak, and N. Zil- and W. Zhuang, “AI-driven packet forwarding with programmable data
berman, “In-network machine learning using programmable network plane: A survey,” IEEE Communications Surveys & Tutorials, 2022.
devices: A survey,” IEEE Communications Surveys & Tutorials, 2023. [49] J. Gomez, E. Kfoury, J. Crichigno, and G. Srivastava, “A survey on TCP
[24] I. Baldin, A. Nikolich, J. Griffioen, I. Monga, K.-C. Wang, T. Lehman, enhancements using P4-programmable devices,” Computer Networks,
and P. Ruth, “FABRIC: A national-scale programmable experimental 2022.
network infrastructure,” IEEE Internet Computing, 2019. [50] S. Han, S. Jang, H. Choi, H. Lee, and S. Pack, “Virtualization in
[25] GEANT, “GEANT testbed.” [Online]. Available: https://geant.org/. programmable data plane: A survey and open challenges,” IEEE Open
[26] GEANT, “High-performance flow monitoring using programmable Journal of the Communications Society, 2020.
network interface cards,” 2023. [51] J. Brito, J. Moreno, L. Contreras, M. Alvarez-Campana, and M. Blanco,
[27] E. da Cunha, M. Martinello, C. Dominicini, M. Schwarz, M. Ribeiro, “Programmable data plane applications in 5G and beyond architectures:
E. Borges, I. Brito, J. Bezerra, and M. Barcellos, “FABRIC testbed A systematic review,” Sensors, 2023.
from the eyes of a network researcher,” in Anais do II Workshop de [52] A. Mazloum, E. Kfoury, J. Gomez, and J. Crichigno, “A survey
Testbeds, 2023. on rerouting techniques with P4 programmable data plane switches,”
[28] D. Cerović, V. del Piccolo, A. Amamou, K. Haddadou, and G. Pujolle, Computer Networks, 2023.
“Fast packet processing: A survey,” IEEE Communications Surveys & [53] M. Chiesa, A. Kamisiński, J. Rak, G. Rétvári, and S. Schmid, “A survey
Tutorials, 2018. of fast recovery mechanisms in the data plane,” Authorea Preprints,
[29] E. Freitas, A. de Oliveira, P. do Carmo, D. Sadok, and J. Kelner, “A 2023.
survey on accelerating technologies for fast network packet processing [54] NVIDIA, “NVIDIA Mellanox BlueField-2 data processing unit
in Linux environments,” Computer Communications, 2022. (DPU).” [Online]. Available: https://tinyurl.com/yrky7ee5.
[30] L. Linguaglossa, S. Lange, S. Pontarelli, G. Rétvári, D. Rossi, T. Zin- [55] AMD, “Pensando DSC2-200 distributed services card.” [Online]
ner, R. Bifulco, M. Jarschel, and G. Bianchi, “Survey of performance Available: https://tinyurl.com/yr6eeez6.
acceleration techniques for network function virtualization,” Proceed- [56] AMD, “Xilinx Alveo SN1000 SmartNIC.” [Online]. Available: https:
ings of the IEEE, 2019. //tinyurl.com/pxacmnd9.
[31] X. Fei, F. Liu, Q. Zhang, H. Jin, and H. Hu, “Paving the way for [57] N. McKeown, “Why does the internet need a programmable forwarding
NFV acceleration: A taxonomy, survey and future directions,” ACM plane.” [Online]. Available: https://tinyurl.com/ffajhk9y.
Computing Surveys (CSUR), 2020. [58] J. Xing, Y. Qiu, K.-F. Hsu, S. Sui, K. Manaa, O. Shabtai, Y. Piasetzky,
[32] P. Shantharama, A. Thyagaturu, and M. Reisslein, “Hardware- M. Kadosh, and A. Krishnamurthy, “Unleashing SmartNIC packet
accelerated platforms and infrastructures for network functions: A processing performance in P4,” in Proceedings of the ACM SIGCOMM
survey of enabling technologies and research studies,” IEEE Access, 2023 Conference, 2023.
2020. [59] S. Kanev, J. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-
Y. Wei, and D. Brooks, “Profiling a warehouse-scale computer,” in
[33] M. Vieira, M. Castanho, R. Pacı́fico, E. Santos, E. Júnior, and L. Vieira,
Proceedings of the 42nd Annual International Symposium on Computer
“Fast packet processing with eBPF and XDP: Concepts, code, chal-
Architecture, 2015.
lenges, and applications,” ACM Computing Surveys (CSUR), 2020.
[60] NVIDIA, “ConnectX-5 EN Card.” [Online]. Available: https://tinyurl.
[34] L. Rosa, L. Foschini, and A. Corradi, “Empowering cloud computing
com/nhcf26nr.
with network acceleration: A survey,” IEEE Communications Surveys
[61] NVIDIA, “ConnectX-6 LX 25/50G Ethernet SmartNIC.” [Online].
& Tutorials, 2024.
Available: https://tinyurl.com/4at7npy5.
[35] The Linux Foundation, “DPDK.” [Online]. Available: https://www.dp
[62] NVIDIA, “ConnectX-6 Dx 200G Ethernet SmartNIC.” [Online].
dk.org/.
Available: https://tinyurl.com/2e59ts66.
[36] Ntop Engineering, “PF RING: High-speed packet capture, filtering and [63] NVIDIA, “ConnectX-7 400G Adapters.” [Online]. Available: https:
analysis.” [Online]. Available: https://tinyurl.com/yzwc4t35. //tinyurl.com/hndz6yxm.
[37] T. Marian, K. Lee, and H. Weatherspoon, “Netslices: Scalable multi- [64] Achronix, “Vectorpath accelerator card.” [Online]. Available: https:
core packet processing in user-space,” in Proceedings of the eighth //tinyurl.com/yc7xachz.
ACM/IEEE symposium on Architectures for networking and communi- [65] AMD, “Xilinx Alveo U50 Data Center Accelerator Card.” [Online].
cations systems, 2012. Available: https://tinyurl.com/nhbe4xbd.
[38] L. Rizzo, “Netmap: a novel framework for fast packet I/O,” in 21st [66] AMD, “Xilinx Alveo U55C Data Center Accelerator Cards.” [Online].
USENIX Security Symposium (USENIX Security 12), 2012. Available: https://tinyurl.com/mr4887yw.
[39] E. Kfoury, J. Crichigno, and E. Bou-Harb, “An exhaustive survey [67] AMD, “Alveo U200 and U250 Data Center Accelerator Cards.” [On-
on P4 programmable data plane switches: Taxonomy, applications, line]. Available: https://tinyurl.com/2p9tzav3.
challenges, and future trends,” IEEE Access, 2021. [68] AMD, “Alveo U280 Data Center Accelerator Card.” [Online]. Avail-
[40] F. Hauser, M. Häberle, D. Merling, S. Lindner, V. Gurevich, F. Zeiger, able: https://tinyurl.com/bdfzke7z.
R. Frank, and M. Menth, “A survey on data plane programming with [69] Napatech, “NT200A02 SmartNIC with Link-Capture Software.” [On-
P4: Fundamentals, advances, and applied research,” Journal of Network line]. Available: https://tinyurl.com/y4xbyypy.
and Computer Applications, 2023. [70] Silicom, “Silicom FPGA SmartNIC N501x Series.” [Online]. Avail-
[41] O. Michel, R. Bifulco, G. Retvari, and S. Schmid, “The programmable able: https://tinyurl.com/4s9mwr88.
data plane: Abstractions, architectures, algorithms, and applications,” [71] Silicom, “Silicom N5110A SmartNIC Intel based.” [Online]. Available:
ACM Computing Surveys (CSUR), 2021. https://tinyurl.com/yskzrzah.
[42] E. Kaljic, A. Maric, P. Njemcevic, and M. Hadzialic, “A survey on data [72] Silicom, “FPGA SmartNIC FB2CDG1@AGM39D-2 Intel based.” [On-
plane flexibility and programmability in software-defined networking,” line]. Available: https://tinyurl.com/3rsbur47.
IEEE Access, 2019. [73] Silicom, “FPGA SmartNIC N6010/6011 Intel based.” [Online]. Avail-
[43] W. da Costa Cordeiro, J. Marques, and L. Gaspary, “Data plane able: https://tinyurl.com/3syps38s.
programmability beyond OpenFlow: Opportunities and challenges for [74] Silicom, “FB4XXVG@Z21D TimeSync SmartNIC FPGA Xilinx
network and service operations and management,” Journal of Network based.” [Online]. Available: https://tinyurl.com/4vdbp3jd.
and Systems Management, 2017. [75] NVIDIA, “Mellanox Innova-2 Flex Open Programmable SmartNIC.”
[44] Y. Gao and Z. Wang, “A review of P4 programmable data planes for [Online]. Available: https://tinyurl.com/3wdy3hxd.
network security,” Mobile Information Systems, 2021. [76] AMD, “Pensando Giglio Data Processing Unit.” [Online]. Available:
[45] A. AlSabeh, J. Khoury, E. Kfoury, J. Crichigno, and E. Bou-Harb, “A https://tinyurl.com/yst9b77m.
survey on security applications of P4 programmable switches and a [77] AMD, “Pensando DSC2-100 100G 2p QSFP56 DPU and DSC2-
STRIDE-based vulnerability assessment,” Computer networks, 2022. 25 10/25G 2p SFP56 DPU Distributed Services Cards for VMware
[46] X. Chen, C. Wu, X. Liu, Q. Huang, D. Zhang, H. Zhou, Q. Yang, and vSphere Distributed Services Engine.” [Online]. Available: https:
M. Khan, “Empowering network security with programmable switches: //tinyurl.com/38ax5jkb.
A comprehensive survey,” IEEE Communications Surveys & Tutorials, [78] Asterfusion, “Helium EC2004Y.” [Online]. Available: https://tinyurl.
2023. com/3bkpn6yv.
34

[79] Asterfusion, “Helium ec2002p.” [Online]. Available: https://tinyurl.co [116] D. Basak, R. Toshniwal, S. Maskalik, and A. Sequeira, “Virtualizing
m/psfr4w6d. networking and security in the cloud,” ACM SIGOPS Operating Sys-
[80] Broadcom, “Stingray PS225 SmartNIC Adapters.” [Online]. Available: tems Review, 2010.
https://tinyurl.com/5f3rpu45. [117] NVIDIA, “DOCA Open vSwitch Layer-4 Firewall.” [Online]. Avail-
[81] Intel, “Infrastructure Processing Unit (Intel IPU) ASIC E2000.” [On- able: https://tinyurl.com/bdfctkaj.
line]. Available: https://tinyurl.com/5d3rbjfb. [118] AMD, “Achieve high throughput: A case study using a Pensando
[82] Marvell, “Marvell LiquidIO III.” [Online]. Available: https://tinyurl.co distributed services card with P4 programmable software-defined net-
m/a7r69vpc. working pipeline.” [Online]. Available: https://tinyurl.com/yj9ttvnh.
[83] Netronome, “Agilio FX 2x10GbE SmartNIC.” [Online]. Available: [119] The Zeek Project, “Zeek, an open source network security monitoring
https://tinyurl.com/28sxth97. tool.” [Online]. Available: https://zeek.org/.
[84] Netronome, “Agilio CX 2x40GbE SmartNIC.” [Online]. Available: [120] The Open Information Security Foundation, “Suricata.” [Online].
https://tinyurl.com/mfpud4pd. Available: https://suricata.io/.
[85] NVIDIA, “NVIDIA BlueField-3 Networking Platform.” [Online]. [121] Cisco, “Snort - network intrusion detection and prevention system.”
Available: https://tinyurl.com/3e5v2xd2. [Online]. Available: https://www.snort.org/.
[86] AMD, “Xilinx Alveo U25N SmartNIC.” [Online]. Available: https: [122] Z. Zhao, H. Sadok, N. Atre, J. Hoe, V. Sekar, and J. Sherry, “Achieving
//tinyurl.com/2dwz7dxe. 100Gbps intrusion prevention on a single server,” in 14th USENIX
[87] AMD, “Alveo U45N Data Center Accelerator Card.” [Online]. Avail- Symposium on Operating Systems Design and Implementation (OSDI
able: https://tinyurl.com/mvtbshy3. 20), 2020.
[88] Intel, “FPGA Product Catalog.” [Online]. Available: https://tinyurl.co [123] J. Chen, X. Zhang, T. Wang, Y. Zhang, T. Chen, J. Chen, M. Xie,
m/ykvxkj3c. and Q. Liu, “Fidas: Fortifying the cloud via comprehensive fpga-based
[89] Napatech, “SmartNIC and IPU Hardware Portfolio.” [Online]. Avail- offloading for intrusion detection: Industrial product,” in Proceedings
able: https://tinyurl.com/yxcbx2p9. of the 49th Annual International Symposium on Computer Architecture,
[90] M. Liu, T. Cui, H. Schuh, A. Krishnamurthy, S. Peter, and K. Gupta, 2022.
“Offloading distributed applications onto SmartNICs using ipipe,” in [124] Y. Zhao, G. Cheng, Y. Duan, Z. Gu, Y. Zhou, and L. Tang, “Secure IoT
Proceedings of the ACM Special Interest Group on Data Communica- edge: Threat situation awareness based on network traffic,” Computer
tion, 2019. Networks, 2021.
[91] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, [125] S. Panda, Y. Feng, S. Kulkarni, K. Ramakrishnan, N. Duffield, and
C. Schlesinger, D. Talayco, A. Vahdat, and G. Varghese, “P4: Pro- L. Bhuyan, “SmartWatch: Accurate traffic analysis and flow-state
gramming protocol-independent packet processors,” ACM SIGCOMM tracking for intrusion prevention using smartnics,” in Proceedings of the
Computer Communication Review, 2014. 17th International Conference on Emerging Networking Experiments
[92] The P4 Language Consortium, “P4 14 language specification.” [On- and Technologies, 2021.
line]. Available: https://tinyurl.com/hzujjzt7. [126] M. Wu, H. Matsutani, and M. Kondo, “ONLAD-IDS: ONLAD-based
[93] The P4 Language Consortium, “P4 16 language specification.” [On- intrusion detection system using SmartNIC,” in 2022 IEEE 24th Int
line]. Available: https://tinyurl.com/5fvfnd8t. Conf on High Performance Computing & Communications, 2022.
[94] “P4 Portable NIC Architecture (PNA).” [Online]. Available: https:
[127] K. Tasdemir, R. Khan, F. Siddiqui, S. Sezer, F. Kurugollu, and A. Bolat,
//tinyurl.com/3v6etke2.
“An investigation of machine learning algorithms for high-bandwidth
[95] AMD, “Xilinx Vivado Design Suite 2023.” [Online]. Available: https:
SQL injection detection utilising BlueField-3 DPU technology,” in
//www.xilinx.com.
2023 IEEE 36th International System-on-Chip Conference (SOCC),
[96] Intel, “Intel P4 Suite for FPGA.” [Online]. Available: https://tinyurl.
2023.
com/42rztah2.
[128] S. Miano, R. Doriguzzi-Corin, F. Risso, D. Siracusa, and R. Sommese,
[97] The Linux Foundation, “DPDK Supported Hardware.” [Online].
“Introducing SmartNICs in server-based data plane processing: The
Available: https://core.dpdk.org/supported/.
DDoS mitigation use case,” IEEE Access, 2019.
[98] The Linux Foundation, “DPDK Pipeline Application.” [Online].
Available: https://tinyurl.com/udutp3jf. [129] The Open Information Security Foundation, “Ignoring traffic.” [On-
[99] The Linux Foundation, “Generic flow API (rte flow) documentation.” line]. Available: https://tinyurl.com/f2kn3snm.
[Online]. Available: https://tinyurl.com/3pwwnnx2. [130] M. Gonen, “Accelerating the Suricata IDS/IPS with NVIDIA BlueField
[100] S. Horman, “OvS hardware offload with TC flower,” in Proceedings DPUs.” [Online]. Available: https://tinyurl.com/ys8n6mmz.
Open vSwitch 2017 Fall Conf. [131] R. Yavatkar, “SmartNICs accelerate the new network edge.” [Online].
[101] NVIDIA, “DOCA Flow.” [Online]. Available: https://tinyurl.com/bdfx Available: https://tinyurl.com/2af6yfp3.
7u98. [132] M. Ceška, V. Havlena, L. Holı́k, J. Korenek, O. Lengál, D. Matoušek,
[102] NVIDIA, “Mellanox ASAP2 Accelerated Switching and Packet Pro- J. Matoušek, J. Semric, and T. Vojnar, “Deep packet inspection in
cessing,” ConnectX and ASAP2 -Accelerated Switcha and Packet Pro- FPGAs via approximate nondeterministic automata,” in 2019 IEEE
cessing, 2019. 27th Annual International Symposium on Field-Programmable Custom
[103] NVIDIA, “DOCA Developer Guide.” [Online]. Available: https://tiny Computing Machines (FCCM), 2019.
url.com/2usa47hs. [133] Y. Yang and V. Prasanna, “High-performance and compact architecture
[104] Intel, “P4 insight.” [Online]. Available: https://tinyurl.com/2v2xajrf. for regular expression matching on FPGA,” IEEE Transactions on
[105] AMD, “Xilinx Vitis Networking P4.” [Online]. Available: https://tiny Computers, 2011.
url.com/bdctjc9b. [134] D. Matoušek, J. Kořenek, and V. Puš, “High-speed regular expression
[106] AMD, “Xilinx XRT and Vitis Platform Overview.” [Online]. Available: matching with pipelined automata,” in 2016 International Conference
https://tinyurl.com/y5jdsypx. on Field-Programmable Technology (FPT), 2016.
[107] Intel, “Intel Open FPGA Stack.” [Online]. Available: https://www.inte [135] D. Luchaup, L. De Carli, S. Jha, and E. Bach, “Deep packet inspection
l.com/. with DFA-trees and parametrized language overapproximation,” in
[108] The Linux Foundation, “Open Programmable Infrastructure Project.” IEEE INFOCOM 2014-IEEE Conference on Computer Communica-
[Online]. Available: https://opiproject.org/. tions, 2014.
[109] The Linux Foundation, “IPDK Documentation.” [Online]. Available: [136] M. Češka, V. Havlena, L. Holı́k, O. Lengál, and T. Vojnar, “Approx-
https://ipdk.io/documentation/. imate reduction of finite automata for high-speed network intrusion
[110] The Linux Foundation, “Sonic-dash.” [Online]. Available: https://tiny detection,” International Journal on Software Tools for Technology
url.com/utcjchme. Transfer, 2020.
[111] L. Xin, “SONiC, Programmability & Acceleration,” 2022. [Online]. [137] N. Diamond, S. Graham, and G. Clark, “Securing InfiniBand traffic
Available: https://tinyurl.com/musxey96. with BlueField-2 data processing units,” in International Conference
[112] J. Thönes, “Microservices,” IEEE software, 2015. on Critical Infrastructure Protection, 2022.
[113] T. Benson, A. Akella, and D. Maltz, “Network traffic characteristics of [138] Q. Su, S. Wu, Z. Niu, R. Shu, P. Cheng, Y. Xiong, C. Xue, Z. Liu, and
data centers in the wild,” in Proceedings of the 10th ACM SIGCOMM H. Xu, “Meili: Enabling SmartNIC as a service in the cloud,” arXiv
conference on Internet measurement, 2010. preprint arXiv:2312.11871, 2023.
[114] Cisco, “Cisco global cloud index 2015–2020.” [Online]. Available: [139] T. T. Bar Tuaf, Tal Gilboa, “kTLS offload performance enhancements
https://tinyurl.com/2ery68x4. for real-life applications,” 2020. [Online]. Available: https://tinyurl.co
[115] V. Stafford, “Zero trust architecture,” NIST special publication, 2020. m/24ep7pwc.
35

[140] D. Kim, S. Lee, and K. Park, “A case for smartnic-accelerated private [169] R. Durner, A. Varasteh, M. Stephan, C. Machuca, and W. Kellerer,
communication,” in Proceedings of the 4th Asia-Pacific Workshop on “HNLB: Utilizing hardware matching capabilities of NICs for offload-
Networking, pp. 30–35, 2020. ing stateful load balancers,” in ICC 2019-2019 IEEE International
[141] F. Novais and F. L. Verdi, “Unlocking security to the board: An Conference on Communications (ICC), 2019.
evaluation of SmartNIC-driven TLS acceleration with kTLS.” [Online]. [170] Y. Zhang, J. Bi, Z. Li, Y. Zhou, and Y. Wang, “VMS: Load balancing
Available: https://tinyurl.com/2p92nsnj. based on the virtual switch layer in datacenter networks,” IEEE Journal
[142] J. Zhao, M. Neves, and I. Haque, “On the (dis) advantages of on Selected Areas in Communications, 2020.
programmable NICs for network security services,” in 2023 IFIP [171] H. Krawczyk, “New hash functions for message authentication,” in
Networking Conference (IFIP Networking), 2023. International Conference on the Theory and Applications of Crypto-
[143] B. Pfaff, J. Pettit, K. Amidon, M. Casado, T. Koponen, and S. Shenker, graphic Techniques, 1995.
“Extending networking into the virtualization layer.,” in Hotnets, 2009. [172] The Linux Foundation, “Scaling in the Linux networking stack.”
[144] P. Emmerich, D. Raumer, F. Wohlfart, and G. Carle, “Performance [Online]. Available: https://tinyurl.com/4fjv42hj.
characteristics of virtual switching,” in 2014 IEEE 3rd International [173] Napatech, “5G user plane function offload.” [Online]. Available: https:
Conference on Cloud Networking (CloudNet), 2014. //tinyurl.com/4jxxeh8t.
[145] W. Tu, Y.-H. Wei, G. Antichi, and B. Pfaff, “Revisiting the open [174] R. Davis, “NVIDIA BlueField partner’s DPU storage solutions and use
vswitch dataplane ten years later,” in Proceedings of the 2021 ACM cases.” [Online]. Available: https://tinyurl.com/2s4kmkrp.
SIGCOMM 2021 Conference, pp. 245–257, 2021. [175] Y. Li, A. Kashyap, Y. Guo, and X. Lu, “Characterizing lossy and
[146] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme, lossless compression on emerging BlueField DPU architectures,” in
J. Gross, A. Wang, J. Stringer, and P. Shelar, “The design and 2023 IEEE Symposium on High-Performance Interconnects (HOTI),
implementation of open vSwitch,” in 12th USENIX symposium on 2023.
networked systems design and implementation (NSDI 15), 2015. [176] L. Peter, “DEFLATE compressed data format specification version 1.3,”
[147] VMware, “vSphere distributed switch.” [Online]. Available: https://ti RFC 1951, 1996.
nyurl.com/2bpwzubd. [177] L. Peter and J. Gailly, “ZLIB compressed data format specification
[148] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Sridhar, version 3.3,” RFC 1950, 1996.
M. Bursell, and C. Wright, “Virtual extensible local area network [178] X. Liang, K. Zhao, S. Di, S. Li, R. Underwood, A. Gok, J. Tian,
(VXLAN): A framework for overlaying virtualized layer 2 networks J. Deng, J. Calhoun, and D. Tao, “SZ3: A modular framework for
over layer 3 networks,” RFC 7348, 2014. composing prediction-based error-bounded lossy compressors,” IEEE
[149] J. Gross, I. Ganga, and T. Sridhar, “Geneve: Generic network virtual- Transactions on Big Data, 2022.
ization encapsulation,” RFC 8926, 2020. [179] E. de Rothschild, “Ai insights - is the acceleration of the power of ai
[150] D. Farinacci, T. Li, S. Hanks, D. Meyer, and P. Traina, “Generic routing models a recent phenomenon?,” Jun 2023.
encapsulation (GRE),” RFC 2784, 2000. [180] Z. Ma, J. He, J. Qiu, H. Cao, Y. Wang, Z. Sun, L. Zheng, H. Wang,
[151] I. Burstein, “NVIDIA data center processing unit (DPU) architecture,” S. Tang, T. Zheng, et al., “Bagualu: targeting brain scale pretrained
in 2021 IEEE Hot Chips 33 Symposium (HCS), 2021. models with over 37 million cores,” in Proceedings of the 27th
[152] J. Weerasinghe and F. Abel, “On the cost of tunnel endpoint processing ACM SIGPLAN Symposium on Principles and Practice of Parallel
in overlay virtual networks,” in 2014 IEEE/ACM 7th International Programming, pp. 192–204, 2022.
Conference on Utility and Cloud Computing, 2014. [181] A. Moody, J. Fernandez, F. Petrini, and D. Panda, “Scalable nic-based
[153] L. Luo, “Towards converged smartnic architecture for bare metal & reduction on large-scale clusters. supercomputing, 2003 acm,” in IEEE
public clouds,” APNet 2018 Industry Talks. Conference, 2003.
[154] NVIDIA, “Virtual switch on DPU.” [Online]. Available: https://tinyur [182] A. S. Da Silva, J. A. Wickboldt, L. Z. Granville, and A. Schaeffer-Filho,
l.com/5n8eb6bz. “Atlantic: A framework for anomaly traffic detection, classification,
[155] B. Claise, “Cisco systems NetFlow services export version 9,” tech. and mitigation in sdn,” in NOMS 2016-2016 IEEE/IFIP Network
rep., 2004. Operations and Management Symposium, pp. 27–35, IEEE, 2016.
[156] B. Claise, M. Fullmer, P. Calato, and R. Penno, “IPFIX protocol [183] T. Itsubo, M. Koibuchi, H. Amano, and H. Matsutani, “Accelerating
specification,” Interrnet-draft, work in progress, 2005. deep learning using multiple gpus and fpga-based 10gbe switch,” in
[157] The P4 Working Group, “In-band network telemetry (INT) dataplane 2020 28th Euromicro International Conference on Parallel, Distributed
specification.” [Online]. Available: https://tinyurl.com/4x9shr45. and Network-Based Processing (PDP), pp. 102–109, IEEE, 2020.
[158] F. Brockners, S. Bhandari, D. Bernier, and T. Mizrahi, “In situ op- [184] K. Tanaka, Y. Arikawa, T. Ito, K. Morita, N. Nemoto, F. Miura,
erations, administration, and maintenance (IOAM) deployment,” tech. K. Terada, J. Teramoto, and T. Sakamoto, “Communication-efficient
rep., 2023. distributed deep learning with gpu-fpga heterogeneous computing,” in
[159] G. Cormode and S. Muthukrishnan, “An improved data stream sum- 2020 IEEE Symposium on High-Performance Interconnects (HOTI),
mary: the count-min sketch and its applications,” Journal of Algorithms, pp. 43–46, IEEE, 2020.
2005. [185] E. de Rothschild, “Nvidia doca allreduce application guide,” Jun 2023.
[160] B. Bloom, “Space/time trade-offs in hash coding with allowable errors,” [186] R. Ma, E. Georganas, A. Heinecke, S. Gribok, A. Boutros, and
Communications of the ACM, 1970. E. Nurvitadhi, “Fpga-based ai smart nics for scalable distributed ai
[161] S. Geravand and M. Ahmadi, “Bloom filter applications in network training systems,” IEEE Computer Architecture Letters, vol. 21, no. 2,
security: A state-of-the-art survey,” Computer Networks, 2013. pp. 49–52, 2022.
[162] Z. Zeng, L. Cui, M. Qian, Z. Zhang, and K. Wei, “A survey on sliding [187] Z. Xiong and N. Zilberman, “Do switches dream of machine learning?
window sketch for network measurement,” Computer Networks, 2023. toward in-network classification,” in Proceedings of the 18th ACM
[163] J. White, J. Kim, M. Baldi, Y. Li, and D. McIntyre, “xPU accelerator workshop on hot topics in networks, pp. 25–33, 2019.
offload functions.” [Online]. Available: https://tinyurl.com/rzyfx5b4. [188] S. Ibanez, G. Brebner, N. McKeown, and N. Zilberman, “The p4-¿
[164] T. Cui, C. Zhao, W. Zhang, K. Zhang, and A. Krishnamurthy, “La- netfpga workflow for line-rate packet processing,” in Proceedings of the
conic: Streamlined load balancers for SmartNICs,” arXiv preprint 2019 ACM/SIGDA International Symposium on Field-Programmable
arXiv:2403.11411, 2024. Gate Arrays, pp. 1–9, 2019.
[165] X. Huang, Z. Guo, and M. Song, “FGLB: A fine-grained hardware [189] B. M. Xavier, R. S. Guimarães, G. Comarela, and M. Martinello,
intra-server load balancer based on 100 G FPGA SmartNIC,” Interna- “Programmable switches for in-networking classification,” in IEEE
tional Journal of Network Management, 2022. INFOCOM 2021-IEEE Conference on Computer Communications,
[166] B. Chang, A. Akella, L. D’Antoni, and K. Subramanian, “Learned pp. 1–10, IEEE, 2021.
load balancing,” in Proceedings of the 24th International Conference [190] D. Sanvito, G. Siracusano, and R. Bifulco, “Can the network be the
on Distributed Computing and Networking, pp. 177–187, 2023. ai accelerator?,” in Proceedings of the 2018 Morning Workshop on
[167] Z. Ni, C. Wei, T. Wood, and N. Choi, “A SmartNIC-based load In-Network Computing, pp. 20–25, 2018.
balancing and auto scaling framework for middlebox edge server,” [191] Redis, “The real-time data platform,” 2024. [Online]. Available: https:
in 2021 IEEE Conference on Network Function Virtualization and //redis.io/.
Software Defined Networks (NFV-SDN), 2021. [192] Danga Interactive, “Memcached - a distributed memory object caching
[168] H. Tajbakhsh, R. Parizotto, M. Neves, A. Schaeffer-Filho, and I. Haque, system.” [Online]. Available: https://memcached.org/.
“Accelerator-aware in-network load balancing for improved application [193] A. Dragojević, D. Narayanan, M. Castro, and O. Hodson, “FaRM: Fast
performance,” in 2022 IFIP Networking Conference (IFIP Networking), remote memory,” in 11th USENIX Symposium on Networked Systems
IEEE. Design and Implementation (NSDI 14), 2014.
36

[194] C. Mitchell, Y. Geng, and J. Li, “Using one-sided RDMA reads to [220] NVIDIA, “Introduction to DOCA for DPUs.” [Online]. Available: ht
build a fast, CPU-efficient key-value store,” in 2013 USENIX Annual tps://tinyurl.com/4tux5eb9.
Technical Conference (USENIX ATC 13), 2013.
[195] A. Kalia, M. Kaminsky, and D. Andersen, “Using RDMA efficiently
for key-value services,” in Proceedings of the 2014 ACM Conference
on SIGCOMM, 2014.
[196] A. Kalia, M. Kaminsky, and D. Andersen, “Design guidelines for
high performance RDMA systems,” in 2016 USENIX Annual Technical
Conference (USENIX ATC 16), 2016. Elie Kfoury received the Ph.D. degree in Informat-
[197] B. Cassell, T. Szepesi, B. Wong, T. Brecht, J. Ma, and X. Liu, ics from the University of South Carolina (USC),
“Nessie: A decoupled, client-driven key-value store using RDMA,” in 2023. He is currently an assistant professor in
IEEE Transactions on Parallel and Distributed Systems, 2017. the Integrated Information Technology department
[198] S. Sun, R. Zhang, M. Yan, and J. Wu, “SKV: A SmartNIC-offloaded at USC. As a member of the Cyberinfrastructure
distributed key-value store,” in 2022 IEEE International Conference on Laboratory, he developed training materials using
Cluster Computing (CLUSTER), 2022. virtual labs on high-speed networks, TCP congestion
[199] J. Liu, A. Dragojević, S. Flemming, A. Katsarakis, D. Korolija, control, programmable switches, SDN, and cyber-
I. Zablotchi, H. Ng, A. Kalia, and M. Castro, “Honeycomb: ordered security. He is the co-author a book “High-Speed
key-value store acceleration on an FPGA-based SmartNIC,” IEEE Networks: A Tutorial”, that is being used nationally
Transactions on Computers, 2023. for deploying, troubleshooting, and tuning Science
[200] A. Kalia, M. Kaminsky, and D. Andersen, “Datacenter RPCs can be DMZ networks. His research interests include P4 programmable data planes,
general and fast,” in 16th USENIX Symposium on Networked Systems computer networks, cybersecurity, and Blockchain. He previously worked as
Design and Implementation (NSDI 19), 2019. a research and teaching assistant in the computer science department at the
[201] C. Chen, Hungand Chang and S. Hung, “HKVS: a framework for American University of Science and Technology in Beirut.
designing a high throughput heterogeneous key-value store with Smart-
NIC and RDMA,” in Proceedings of the Conference on Research in
Adaptive and Convergent Systems, 2022.
[202] J. Li, Y. Lu, Q. Wang, J. Lin, Z. Yang, and J. Shu, “AlNiCo SmartNIC-
accelerated contention-aware request scheduling for transaction pro-
cessing,” in 2022 USENIX Annual Technical Conference (USENIX ATC
22), 2022. Samia Choueiri is a Ph.D. student in the College
[203] H. Schuh, W. Liang, M. Liu, J. Nelson, and A. Krishnamurthy, “Xenic: of Engineering and Computing at the University of
SmartNIC-accelerated distributed transactions,” in Proceedings of the South Carolina (USC). Her research interests include
ACM SIGOPS 28th Symposium on Operating Systems Principles, 2021. SmartNICs, P4 switches, cybersecurity, and robotics.
[204] S. Choi, M. Shahbaz, B. Prabhakar, and M. Rosenblum, “λ-nic: She received her Masters in Computer and Commu-
Interactive serverless compute on programmable smartnics,” in 2020 nications Engineering with emphasis in Mechatron-
IEEE 40th International Conference on Distributed Computing Systems ics Engineering from the American University of
(ICDCS), 2020. Science and Technology in Beirut, where she also
[205] Amazon, “Serverless function, FaaS service, AWS lambda.” was a teaching assistant and lab instructor.
[206] Google, “Google cloud functions.” [Online]. Available: https://tinyurl.
com/acayx98p.
[207] Microsoft, “Azure functions.” [Online]. Available: https://tinyurl.com/
a7wat88a.
[208] The Linux Foundation, “IPDK.” [Online]. Available: https://ipdk.io/.
[209] A. Zulfiqar, B. Pfaff, W. Tu, G. Antichi, and M. Shahbaz, “The
slow path needs an accelerator too!,” ACM SIGCOMM Computer
Communication Review, 2023.
Ali Mazloum is a Ph.D. student in the College
[210] Y. Le, H. Chang, S. Mukherjee, L. Wang, A. Akella, M. Swift, and
of Engineering and Computing at the University of
T. Lakshman, “UNO: Unifying host and smart NIC offload for flexible
South Carolina (USC) in the United States of Amer-
packet processing,” in Proceedings of the 2017 Symposium on Cloud
ica. Prior to joining USC, he received his bachelor’s
Computing, 2017.
in computer science from the American University
[211] S. Wang, Z. Meng, C. Sun, M. Wang, M. Xu, J. Bi, T. Yang, Q. Huang,
of Beirut (AUB). His research focuses on P4 pro-
and H. Hu, “SmartChain: Enabling high-performance service chain
grammable data planes, SmartNICs, cybersecurity,
partition between SmartNIC and CPU,” in ICC 2020-2020 IEEE
network measurements, and traffic engineering.
International Conference on Communications (ICC), 2020.
[212] Y. Zhou, M. Wilkening, J. Mickens, and M. Yu, “SmartNIC security
isolation in the cloud with S-NIC,” 2024.
[213] Y. Qiu, Q. Kang, M. Liu, and A. Chen, “Clara: Performance clarity
for SmartNIC offloading,” in Proceedings of the 19th ACM Workshop
on Hot Topics in Networks, 2020.
[214] The official ServeTheHome.com YouTube channel, “Servethehome.”
[Online]. Available: https://tinyurl.com/yc58uapm.
[215] The official SNIA YouTube channel, “SNIAVideo.” [Online]. Available: Ali AlSabeh is currently a Ph.D. student in the
https://tinyurl.com/3bhdb7kd. College of Engineering and Computing at the Uni-
[216] The official OPI YouTube channel, “The open programmable infras- versity of South Carolina, USA. He is a member
tructure.” [Online]. Available: https://www.youtube.com/@OPI proje of the CyberInfrastructure Lab (CI Lab), where he
ct. developed training materials for virtual labs on net-
[217] K. A. Simpson and D. P. Pezaros, “Revisiting the classics: Online rl in work protocols (BGP, OSPF) and their applications
the programmable dataplane,” in NOMS 2022-2022 IEEE/IFIP Network (BGP attributes, BGP hijacking, IP spoofing, etc.),
Operations and Management Symposium, pp. 1–10, IEEE, 2022. as well as SDN (OpenFlow, interconnecting SDN
[218] J. Xing, K. Hsu, M. Kadosh, A. Lo, Y. Piasetzky, A. Krishnamurthy, with legacy networks, etc.). He previously earned his
and A. Chen, “Runtime programmable switches,” in 19th USENIX M.S. degree in Computer Science from the Ameri-
Symposium on Networked Systems Design and Implementation (NSDI can University of Beirut, where he also worked as a
22), 2022. graduate research assistant and teacher assistant. His area of research focuses
[219] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger, on malware analysis, network security, and P4 programmable switches.
G. Mendelson, M. Alizadeh, S. Chuang, and I. Keslassy, “DRMT: Dis-
aggregated programmable switching,” in Proceedings of the Conference
of the ACM Special Interest Group on Data Communication, 2017.
37

Jose Gomez is a Ph.D. student in the College


of Engineering and Computing at the University
of South Carolina. Jose’s research focuses on P4
programmable data planes, TCP congestion control,
passive measurements, and buffer sizing. Currently,
Jose is working at the Cyberinfrastructure lab de-
veloping a system based on P4 switches to enable
programmability in non-programmable networks.

Jorge Crichigno is a Professor in the College of En-


gineering and Computing at the University of South
Carolina (USC). He has over 15 years of experience
in the academic and industry sectors. Prior to joining
USC, Dr. Crichigno was an Associate Professor and
Chair of the Department of Engineering at Northern
New Mexico College. Dr. Crichigno’s research fo-
cuses on the practical implementation of high-speed
networks and network security. These include the
design and implementation of high-speed switched
networks, TCP optimization, experimental evalua-
tion of congestion control algorithms tailored for friction-free environments,
and scalable flow-based intrusion detection systems. His work has been funded
by Google, NSF, and the Department of Energy. He received his PhD in
Computer Engineering from the University of New Mexico in 2009, and his
Bachelor degree in Electrical Engineering from the Catholic University of
Paraguay in 2004.

You might also like