Comprehensive Survey SmartNIC
Comprehensive Survey SmartNIC
Fig. 2. Network port speeds (solid line, right y-axis) versus processor speed Discrete SmartNICs
(dotted line, left y-axis) over the years. Port speeds are increasing while Section IV: SmartNIC SoC SmartNICs
Architectures On-path and off-path SmartNICs
processor speeds are plateauing. Reproduced from [12].
Commercial SmartNICs
TABLE I
C OMPARISON WITH RELATED SURVEYS .
Evolution and Architectures Development Applications and offloaded Comparisons Challenges and Research trends
Paper
definition and models environments workloads taxonomy with regular NICs discussions and directions
[28] d qd qd d d qd qd
[29] d q
d q
d d d dq q
d
[30] q
d q
d q
d dq d dq q
d
[31] d q
d q
d dq d dq q
d
[32] q
d q
d q
d dq dq dq q
d
[33] d d q
d dq d dq q
d
[34] q
d q
d q
d dq d dq q
d
This t t t t t t t
survey
tCovered in this survey dNot covered in this survey dqPartially covered in this survey
per discusses the challenges associated with SmartNICs and packet processing in high-speed networks and then delves
concludes by discussing future perspectives and open research into exploring the different classes of packet accelerators.
issues. For the software-based packet accelerators, the survey mainly
describes and analyzes Data Plane Development Kit (DPDK)
B. Paper Organization [35], PF RING [36], NetSlices [37], and Netmap [38]. For
the hardware-based packet accelerators, it focuses mainly
The road map of this survey is depicted in Fig. 3. Section on leveraging GPUs and Field Programmable Gate Arrays
II compares existing surveys on SmartNICs and related tech- (FPGAs) for optimized and efficient packet processing. The
nologies and demonstrates the novelty of this work. Section survey does not cover the latest generation of SmartNICs that
III presents an overview of the evolution of NICs, from tradi- include CPU cores and domain-specific accelerators. Also, the
tional basic NICs to SmartNICs. It describes the components survey does not cover the applications or the infrastructure
of SmartNICs and their benefits compared to legacy NICs. workloads that can be offloaded to SmartNICs.
Section IV describes the SmartNICs hardware architectures.
Section V describes the tools, frameworks, and development Freitas et al. [29] describe multiple packet processing accel-
environments for SmartNICs, both open-source and vendor- eration techniques. The survey focuses on packet processing
specific. Section VI provides a taxonomy of the applications in Linux environments. It categorizes the packet processing
and infrastructure workloads that are offloaded to SmartNICs. acceleration into hardware, software, and virtualization-based.
The subsequent sections (Section VII-X) describe the security, For each category, the survey offers background information
network, storage, and compute functions. Section XI lists and discusses a simple use case. The survey also provides dis-
challenges associated with SmartNICs. It then discusses cur- cussions on the host resource usage efficiency, the high packet
rent initiatives that overcome the challenges and provides a rate, the system security, and the flexibility/expandability. The
reflection on open research issues. Section XII concludes the survey briefly mentions programmable NICs (another term
paper. The abbreviations used in this article are summarized used for SmartNICs) and their role in accelerating packet
in Table XII, at the end of the article. processing. It does not cover their development environments,
hardware architectures, and the applications/workloads that
can be offloaded.
II. R ELATED S URVEYS
Linguaglossa et al. [30] focus on software and hardware
Despite the widespread interest from both industry and technologies that accelerate Network Function Virtualization
academia in SmartNICs, there is a noticeable absence of a (NFV). It categorizes software acceleration technologies into
comprehensive survey that adequately explores their potential pure software acceleration and hardware-supported functions
and ongoing research endeavors. The existing surveys that are in software. It also provides a brief overview of the software
closest to this paper can be divided into 1) packet processing acceleration ecosystem which includes DPDK, XDP, Netmap,
acceleration; and 2) programmable data planes. and PF RING. For the hardware technologies, it discusses the
offloading functions of traditional NICs (e.g., CRC calculation,
A. Surveys on Packet Processing Acceleration checksum computation, and TCP Offload Engine (TOE)) and
The existing surveys in this category discuss the advantages a subset of the hardware architectures of SmartNICs. Then,
of accelerating packet processing, particularly with software it provides a brief overview of the programming abstractions
technologies. However, while SmartNICs are occasionally in SmartNICs. The survey has the following limitations: 1) it
mentioned in these surveys, they fail to delve into crucial does not cover all the hardware architectures; 2) it does not
aspects such as their potential, architectures, applications, etc. cover the development tools and environments; and 3) it does
Cerović et al. [28] discuss various software-based and not cover the applications and infrastructure workloads that
hardware-based packet accelerators. The survey focuses on can be offloaded to SmartNICs.
server-class networking. It first starts by explaining the prob- Fei et al. [31] also focus on NFV acceleration. The survey
lems associated with using the standard Linux kernel for classifies NVF acceleration into three high-level categories:
4
Traffic
Traffic
Checksum Manager
and serial and Prog. pipeline
TX TX TX or
IO serial IO NIC switch
Others Domain-specific
Fig. 4. Main functional blocks of (a) traditional NICs, (b) offload NICs, and (c) SmartNICs.
computation, communication, and traffic steering. Under the as a Service (NAaaS). While discussing hardware-based tech-
computation category, the survey discusses some hardware of- niques, the survey only focuses on RDMA. The authors only
floading architectures which include SmartNICs. The remain- describe SmartNICs as an enabling technology for RDMA and
ing of the survey focuses on software acceleration and how virtualization without describing their different architectures,
to tune the system to achieve better performance. The survey development environments, of their different capabilities for
has the following limitations: 1) it does not cover the hardware enhancing network acceleration.
architectures used by the latest generation of SmartNICs; 2)
it does not cover the development tools and environments; 3)
B. Surveys on Programmable Data Planes
it does not cover the applications and workloads that can be
offloaded to the SmartNIC. Numerous surveys have covered the general aspects of
Shantharama et al. [32] provide a comprehensive survey on programmable data planes in the past few years [39]–[43].
softwarized NFs. The survey classifies the CPU, the memory, Some surveys focused on specific areas such as network
and the interconnects as the three main enabling technologies security [44]–[46], ML training and inference [47], [48], TCP
for NFVs. With low-level details, the survey explains how each enhancements [49], virtualization and cloud computing [50],
class operates and how it can be optimized to provide better 5G and telecommunications [51], rerouting and fast recovery
virtualization support. It also discusses the use of dedicated [52], [53]. All these surveys have discussed some applica-
hardware accelerators (FPGAs, ASICs, etc.) to improve the tions developed on SmartNICs. However, their focus is on
performance of softwarized NFs. The survey briefly describes programmable switches (e.g., Intel’s Tofino). Recent advances
some of the applications offloaded to SmartNICS without in SmartNICs are not covered in these surveys.
providing a sufficient overview of the technology, the different
available development environments, or the latest enhancement C. Novelty
in the field of SmartNICs.
Table I summarizes the topics and the features described
Vieira et al. [33] only focus on the extended Berkeley Packet
in the related surveys. It also highlights how this paper
Filter (eBPF) and the eXpress Data Path (XDP) software
differs from the existing surveys. To the best of the authors’
acceleration techniques. The survey illustrates the process of
knowledge, this work is the first to exhaustively explore the
enhancing packet processing speed by running eBPF-based
whole SmartNIC ecosystem. Unlike previous surveys, this
applications in the XDP layer of the Linux kernel network
survey provides in-depth discussions on the evolution and
stack. It presents a tutorial that includes the compilation
definition of SmartNICs, the common architectures used by
and verification processes, the program structure, the required
various SmartNIC models in the market, and the development
tools, and walk-through example programs. Although the
environments (both open source and proprietary). It then
authors mentioned SmartNICs as a target platform for eBPF
provides a detailed taxonomy covering the applications that are
applications, it does not cover the available architectures of
offloaded to SmartNICs, while highlighting the performance
SmartNICs, the development environments, or the applications
gains compared to regular NICs. The survey also presents
that can be offloaded to SmartNICs.
the challenges associated with programming and deploying
Rosa et al. [34] describe multiple software and hardware SmartNICs, as well as the current and future research trends.
techniques to enhance packet processing speed in the cloud.
While discussing software-based techniques, the survey fo-
cuses on zero-copy data transfers, minimal context switching, III. E VOLUTION OF N ETWORK I NTERFACE C ARDS (NIC S )
and asynchronous processing as the core techniques for net- There are three main generations of NICs: traditional NICs,
work acceleration. After that, it shows how DPDK, XDP, and offload NICs, and SmartNICs. Fig. 4 shows a simplified
eBPF are used in the cloud to enable Network Acceleration diagram of the three NICs.
5
Packets
...
a combination of processors used for custom packet processing
and other domain-specific functions. Some SmartNICs (e.g.,
Programmable Stage 1 Stage 2 Stage n Programmable
1 In parser Programmable match-action pipeline deparser
this context, infrastructure functions refer to tasks that facilitate data
movement to the host and do not involve application data. State Memory ALU
2 IPU is the terminology used by Intel.
3 xPU is used by the Storage Networking Industry Association (SNIA)
community. Fig. 6. Programmable Pipeline.
6
SmartNIC Architectures
Host cores
Host cores
System on Chip (SoC) NIC cores
Discrete
Table II contrasts the main characteristics of traditional, of- Hardware implementations come with tradeoffs in terms of
fload, and SmartNICs. In the latter, the infrastructure functions cost, programming simplicity, and adaptability. While an ASIC
are separated from the user applications; this isolation im- offers cost-effectiveness and optimal price performance, its
proves security by protecting the user applications on the host. flexibility is limited. ASIC-based SmartNICs feature a pro-
The separation is possible due to the presence of CPU cores grammable data path that is relatively straightforward to con-
and domain-specific accelerators on the SmartNIC. Moreover, figure, yet this programmability is constrained by predefined
the data plane (i.e., packet processing) of the SmartNIC is functions within the ASIC, leading to potential limitations
customizable and is defined by the developer’s code; this in supporting certain workloads. In contrast, an FPGA-based
provides flexibility in defining and processing new protocols SmartNIC is exceptionally programmable. Given sufficient
as well as innovating with new applications. The technology time, effort, and expertise, it can efficiently accommodate
maturity and the standardized architectures for SmartNICs can nearly any functionality within the confines of available gates
still be considered low in contrast to traditional and offload on the FPGA. However, FPGAs are known for being challeng-
NICs. ing to program and can be costly.
Integrating both ASIC and FPGA within the SmartNIC
presents a balanced solution. Common functions are efficiently
executed on the ASIC, leveraging its ease of programmability
IV. S MART NIC S A RCHITECTURES
compared to the FPGA. Functions that cannot be programmed
on the ASIC will be implemented on the FPGA, providing
The definition of a SmartNIC in Section III-C targeted SoC
flexibility, albeit with increased programming complexity. This
SmartNICs. SoC SmartNICs comprise computing units, which
design provides high packet processing speed but is costly due
include a general-purpose ARM/MIPS multicore processor. It
to the use of FPGA technology.
also includes a multi-level onboard memory hierarchy. There is
another category of SmartNICs, referred to as discrete Smart- Table IV shows some popular commercial discrete Smart-
NICs. A discrete SmartNIC does not incorporate CPU cores NICs from various vendors and their specifications.
and thus, cannot run autonomously without a host platform.
Regardless of whether the SmartNIC is SoC or discrete, its
packet processing logic may be ASIC and FPGA. Various B. SoC SmartNICs
SmartNICs available in the market may employ either of these
hardware architectures or in some cases, a combination of Integrating general-purpose CPU cores into the SmartNIC
both. The SmartNICs architecture taxonomy is shown in Fig. can offer several advantages: 1) it significantly reduces pro-
7. Table III summarizes the differences between the various gramming complexity, as these cores can be programmed
SmartNIC architectures, as described next. using languages such as C; 2) the flexibility of the system
is greatly enhanced, allowing for the implementation of a
wide range of programs, including those with complex features
TABLE III like loops and multiplications. This versatility is particularly
C OMPARISON BETWEEN VARIOUS S MART NIC ARCHITECTURES . challenging to achieve on ASIC or FPGA; 3) the management
of the SmartNIC will be easier and independent of the host;
Programming 4) it will be possible to run an OS and make the SmartNIC
Architecture Cost Flexibility Speed
Complexity
ASIC Low Low Low High
autonomous. While the CPU cores allow additional features
FPGA High High Medium High on the NIC, functions executed on the CPU cores might
ASIC + FPGA High Medium Medium High not achieve line-rate performance and could incur increased
ASIC + CPU Medium Low High Medium latency. Table V shows some popular commercial SoC Smart-
FPGA + CPU High High High Medium
NICs from various vendors and their specifications.
8
TABLE IV
C OMMERCIAL D ISCRETE S MART NIC S FROM VARIOUS V ENDORS .
TABLE V
C OMMERCIAL S O C S MART NIC S FROM VARIOUS V ENDORS .
C. On-path and Off-path SmartNICs cores and memory in a separate SoC located next to the
Another way to categorize the architectures of SmartNICs NIC cores. The offloaded code is strategically placed off the
is based on how their NIC cores interact with network traffic. critical path of the network processing pipeline. The SoC is
There are two categories: on-path and off-path [90]. treated as a second full-fledged host with an exclusive network
1) On-path SmartNICs: With on-path SmartNICs (Fig. 8 interface, connected to NIC cores and the host through an
(a)), the NIC cores actively manipulate each incoming and out- embedded switch (sometimes referred to as eSwitch). Based
going packet along the communication path. These SmartNICs on forwarding rules installed on the embedded switch, the
provide low-level programmable interfaces, allowing for direct traffic will be delivered to the host or the SmartNIC cores. In
manipulation of raw packets. In this design, the offloaded code contrast to on-path SmartNICs, the offloaded code in off-path
is closely situated to the network packets, increasing efficiency. SmartNICs does not impact the host’s network performance.
However, the drawback is that the offloaded code competes This clear separation enables the SoC to run a complete
for NIC cores with requests sent to the host. If too much kernel (e.g., Linux) with a comprehensive network stack
computation is offloaded onto the SmartNIC, it can result in (RDMA), simplifying system development and allowing for
a significant degradation of regular networking requests sent the offloading of complex tasks.
to the host. Additionally, programming on-path NICs can be Table VI summarizes the differences between the on-path
challenging due to the utilization of low-level APIs. and off-path SmartNICs.
2) Off-path SmartNICs: Off-path SmartNICs (Fig. 8 (b)) V. S MART NIC S D EVELOPMENT T OOLS AND
take a different approach by incorporating additional compute F RAMEWORKS
This section provides an overview of the development tools
TABLE VI and frameworks employed for programming SmartNICs. The
C HARACTERISTICS OF ON - PATH AND OFF - PATH S MART NIC S . taxonomy, illustrated in Fig. 9, categorizes them based on the
specific component within the SmartNIC being programmed.
Characteristics On-path Off-path
NIC switch × ✓ A. Programmable Pipeline
Operating system × ✓
Full network stack × ✓ The packet processing logic is commonly built using ASICs
Programming complexity High Low or FPGAs. The development of offloaded applications depends
Host performance impact High Low on the hardware architecture and the vendor’s Software De-
Complex code offloading Low High
velopment Kits (SDKs).
9
SmartNIC Development
Tools and Frameworks
P4 Architecture: PNA Data Plane Dev. Kit (DPDK) ovs-vswitchd DOCA, ASAP 2 IPDK
P4 Compiler (p4c) Berkeley Packet Filter (BPF) OCTEON SDK OPI
OvS Data Plane Pensando SSDK SONIC-DASH
eXpress Data Path (XDP)
FPGA Programming Barefoot SDE
P4 Backends DPDK-based: rte_flow
VHDL, Verilog, ... Kernel-based: TC Flower FPGA
P4-DPDK
P4-FPGA Vitis Networking P4
P4-eBPF
Achronix ® Tool Suite
P4 to FPGA Bistream P4-uBPF Link
Netcope P4
OPAE
Fig. 9. Taxonomy of SmartNIC development tools and frameworks, categorized by component-specific technologies and software development environments.
Main parser
... ...
Frontend Intermediate
Host 1
compiler Representation (IR) Backend
Message processing
Target N
compiler N
Network Ports
Pre control
...
Main control Fig. 12. P4 compilation process.
Host N
Main deparser
TO_HOST TO_NET compiled by the backend compiler for a specific target. The
backend is provided by the manufacturer of the device.
Host-to-net Net-to-host 4) FPGA Programming: FPGAs consist of an array of
inline extern inline extern configurable logic blocks and programmable interconnects,
allowing users to define the functionality of the chip based
Fig. 11. Portable NIC Architecture (PNA). on their application requirements. FPGA-based SmartNICs
follow the same programming workflows as other FPGAs
provided by the vendors. This means that the development
one or more host operating systems, drivers, and/or message tools, methodologies, and languages used for programming
descriptor formats in host memory. traditional FPGAs can be applied to SmartNICs as well. FPGA
PNA has features that are not traditionally supported by vendors provide software tools that facilitate the program-
other similar P4 architectures including: ming process. These tools include Integrated Development
1) Table entries modification: Other P4 architectures only Environments (IDEs) and compilers that translate Hardware
allow modifying table entries from the control plane. PNA Description Languages (HDLs) such as VHDL and Verilog
allows modifying the entries in a table directly from the into configuration files for the FPGA.
data plane. 5) P4-FPGA: Programming FPGAs with languages such
2) Table accessibility: traditional P4 architectures allow only as VHDL or Verilog can be challenging and time-consuming,
one operation on a table per stage. With PNA, tables can be especially for newcomers. To address this issue, frameworks
accessible by multiple stages, even in different pipelines. have been developed to translate P4 code into FPGA bitstream.
3) Non-packet processing: PNA facilitates message process- P4, being a high-level and user-friendly language ideal for
ing, enabling operations on larger blocks of data to be programming datapaths, offers a faster and more efficient
transferred to and from host memory. alternative for FPGA programming. This approach streamlines
4) Accelerator invocation: PNA is the only P4 architecture that the programming process, making it particularly accessible for
supports invoking accelerators (e.g., crypto accelerator). users without extensive FPGA programming expertise, ulti-
Table VII compares and contrasts PNA and the Portable mately enhancing both accessibility and efficiency. However,
Switch Architecture (PSA), an architecture mainly used by there are challenges in designing a compiler that translates
switches. P4 code to VHDL or Verilog. First, FPGAs are typically
3) P4 Compiler: After writing a P4 program, the program- programmed using low-level libraries that are not portable
mer invokes the compiler to generate a binary that will be across devices. Second, generating an efficient implementation
deployed on the target device (e.g., the programmable pipeline from a source P4 program is difficult since programs vary
of the SmartNIC). Consider Fig. 12. The P4 compiler (p4c) has widely and architectures make different tradeoffs.
a frontend and a backend. The frontend is universal across all The community has been actively working on developing P4
targets and handles the parsing, syntactic analysis, and target- FPGA compilers. The vendors (e.g., Xilinx [95], Intel [96])
independent semantic analysis of the program. The frontend are providing their workflows to generate bitstreams from
generates an Intermediate Representation (IR) which is then P4 on their targets. P4-FPGA tools can significantly reduce
the engineering effort required to develop packet-processing
systems based on devices while maintaining high performance
TABLE VII per Lookup Table (LUT) or Random Access Memory (RAM).
C OMPARISON BETWEEN P4 ARCHITECTURES : PSA AND PNA.
Kernel Kernel
space Kernel Kernel
space space space
Stack Stack
Stack Stack
Kernel Kernel
Network Network Kernel Kernel
driver driver Network Network
driver driver
XDP
Hardware Hardware
NIC NIC
Hardware Hardware
NIC
(a) Standard packet processing (b) Kernel-bypass using DPDK NIC
XDP
Fig. 13. Software packet processing. (a) standard packet processing (interrupt- (a) Native XDP (b) Offloaded XDP
based), (b) kernel-bypass packet processing (polling mode).
Fig. 14. XDP packet processing. (a) Native XDP, slower, (b) Offloaded XDP,
faster.
application. These steps induce overheads that dramatically
degrade the bandwidth throughput. Today’s NICs have al-
ready reached more than 200Gbps [18]. As NICs become a packet is received on the NIC. There are three models for
faster, the available time for processing individual packets deploying an XDP program:
becomes increasingly limited. For instance, with 200Gbps, the
time between consecutive 1500-byte packets is as low as 60 • Generic XDP: XDP programs are incorporated into the
nanoseconds (ns). The standard network stack is inadequate kernel within the regular network path. While this method
to keep up with the high traffic rates. does not deliver optimal performance advantages, it serves
1) Data Plane Development Kit (DPDK): DPDK comprises as a convenient means to experiment with XDP programs
a collection of libraries and drivers designed to enhance or deploy them on standard hardware that lacks dedicated
packet processing efficiency by bypassing the kernel space support for XDP.
and handling packets within user space (see Fig. 13 (b)). • Native XDP: The NIC driver loads the XDP program during
With DPDK, the ports of the NIC are disassociated from the its initial receive path, see Fig. 14 (a). Support from the NIC
kernel driver and associated with a DPDK-compatible driver. hardware is required for this mode.
In contrast to the conventional method of packet processing • Offloaded XDP: The XDP program is loaded directly by the
within the kernel stack using interrupts, the DPDK driver NIC hardware, bypassing the CPU as a whole, see Fig. 14
operates as a Poll Mode Driver (PMD). It consistently polls (b). This requires support from the NIC hardware.
for incoming packets. The utilization of a PMD, combined 3) P4 Backends: Creating P4 programs is generally con-
with the kernel bypass, yields superior packet processing sidered more straightforward compared to writing DPDK
performance. DPDK’s APIs can be used in C programs. or BPF/XDP code. Consequently, there have been efforts
DPDK started as a project by Intel and then became open to translate P4 into these codes. The P4 compiler (p4c) is
source. Its community has been growing, and DPDK now equipped with backends specifically designed for generating
supports all major CPU and NICs architectures from various DPDK, BPF/XDP, and Userspace BPF (uBPF) codes. Table
vendors. A list of supported NICs can be found at [97]. VIII compares the P4 backends.
2) eXpress Data Path (XDP) and extended Berkeley Packet
a) P4-DPDK: The p4c-dpdk backend translates P416
Filter (eBPF): When utilizing DPDK, the kernel is bypassed
programs into DPDK Application Programming Interface
to achieve enhanced performance. However, this comes at the
(API), allowing the configuration of the DPDK software
cost of losing access to networking functionalities provided
switch (SWX) pipeline [98]. The P4 programs can be written
by the kernel. User space applications are then required to
re-implement these functionalities. XDP presents a solution
to this issue. XDP operates as an eBPF program within the
kernel’s network code. It introduces an early hook in the RX TABLE VIII
C OMAPRISON BETWEEN THE P4 BACKENDS .
(receive) path of the kernel, specifically within the NIC driver
after interrupt processing. This early hook allows the execution Features P4-DPDK P4-eBPF/XDP P4-uBPF
of a user-supplied eBPF program, enabling decisions to be Userspace ✓ × ✓
made before the Linux networking stack code is executed. NIC support ✓ ✓ ✓
Decisions include dropping packets, passing packets to the [ebpf,xdp]
P4 Architectures PNA, PSA ubpf model.p4
model.p4
normal network stack, and redirecting packets to other ports on P4→spec P4→C→ P4→C→
the NIC. XDP reduces the kernel overhead and avoids process Compilation
→C→so eBPF bytecode uBPF bytecode
context switches, network layer processing, interrupts, etc. Supported
High Low Medium
Features
XDP programs have callbacks that will be invoked when
12
Machines (VMs). OvS has two major components, the control (a) OvS DPDK with HW offload (b) OvS kernel with HW offload (TC Flower)
4 Preliminary experiments show that more features are implemented for PNA Fig. 17. OvS hardware offload. (a) OvS DPDK with hardware offload using
over PSA for the P4-DPDK target. rte flow; (b) OvS kernel with hardware offload using tc flower.
13
OCTEON SDK
gRPC
Host client Extension packages
port
processing by defining matching criteria and actions. These
match-action units are defined in pipes, which can be chained.
Data traffic path Runtime configuration Given DOCA’s reliance on DPDK, it leverages rte flow to
transmit rules to the embedded switch (NIC switch). NVIDIA
Fig. 18. DOCA URL filter reference application. employs its proprietary ASAP2 technology [102] for imple-
menting the embedded switch and for efficient traffic offload-
ing to the hardware.
specific traffic, altering the packets, querying related counters, Consider Fig. 18 which shows an example of a DOCA
etc. Matching within this context can be based on various application for Uniform Resource Locator (URL) filtering. The
criteria such as packet data (including protocol headers and developer must create OvS bridges and connect scalable func-
payload) and properties like associated physical port or virtual tions (SF)5 to them. Note that the OvS bridge is hardware of-
device function ID. Operations supported by the rte flow API floaded. In this specific example, one bridge is used to connect
include dropping traffic, diverting traffic to specific queues, the physical port to the application (OvS-BR2). Another bridge
directing traffic to virtual or physical device functions or ports, is used to connect the application to the host (OvS-BR1). The
performing tunnel offloads, and applying marks, among others. incoming packets on the physical port will be forwarded to
b) OvS-Kernel and TC Flower: The OvS-kernel can use the application, which runs on the CPU cores. URL filtering
the TC Flower [100] to configure rules on the hardware switch involves parsing the application layer because the URL to be
integrated into the SmartNIC, see Fig. 17 (b). Within the Linux visited is located at the HTTP header. The SmartNIC will
kernel, the TC flower classifier, which is a component of the invoke the regular expression (RegEx) hardware accelerator to
TC subsystem, offers a means to specify packet matches using scan for the URL, which is significantly faster than scanning
a defined flow key. This flow key encompasses fields extracted using the CPU. A third bridge can be created to enable the
from packet fields and, if desired, tunnel metadata. TC actions user to manage the application (e.g., specifying the URLs to be
enable the execution of diverse operations on packets, such as blocked). BlueField provides gRPC interfaces for the runtime
drop, modify, output, and various other functionalities. configuration.
It is possible to develop DOCA applications without the
D. Vendor-specific SDKs - ASIC hardware; however, testing the compiled software must be
done on top of a BlueField [103].
The following SDKs are proprietary and target ASIC-based 2) OCTEON SDK: The OCTEON SDK is a comprehensive
SmartNICs. suite that integrates a development environment and optimized
1) NVIDIA’s DOCA: The Data Center-on-a-Chip Architec- software modules for building applications on OCTEON fam-
ture (DOCA), is a software development framework developed ily processors. The suite consists of a base SDK, a vir-
by NVIDIA for the BlueField SmartNICs [54]. This frame- tualization layer, and a collection of SDK extension pack-
work encompasses various components, including libraries, ages designed for specific application functions. The Base
service agents, and reference applications. Applications devel- SDK relies on a standard Linux environment and user-space
oped using DOCA are written in the C programming language DPDK (see Fig. 19). It facilitates the seamless compilation
and incorporate support for DPDK. This integration ensures of DPDK, Linux, or control plane applications on top of it
that developers have access to all DPDK APIs for efficient with minimal adjustments. Programmers write C code and
packet processing. Additionally, DOCA comes equipped with invoke libraries for accelerating functions, including compres-
its own set of libraries designed to streamline interactions with sion/decompression, regex matching, encryption/decryption,
the components on the SmartNIC. For instance, to implement and more.
IPsec or perform encryption and decryption, DOCA offers In addition to the Base SDK, the suite includes SDK
dedicated APIs that developers can easily invoke, simplifying extensions that help users enable complex applications. These
the integration of these functionalities into their applications.
One noteworthy library within DOCA is the DOCA Flow 5 An SF is a lightweight function that has dedicated queues for sending and
[101]. This library allows programmers to customize packet receiving packets; it is analogous to the virtual function (VF) used in SR-IOV.
14
TABLE IX
AMD Pensando SSDK
C OMPARISON BETWEEN THE VENDOR - SPECIFIC SDK S FOR ASIC
Development Libraries and sample code S MART NIC S .
toolchain Platform CPU sample P4 sample
NVIDIA Octeon Pensando Intel/Barefoot
library codes code Characteristic
Build environment DOCA SDK SSDK SDE
with P4 compiler Supported BlueField Marvel Pensando Intel IPU
Drivers for CPU cores
SmartNICs 2/3/X LiquidIO DSC-200 E2000
DSC simulator and
Linux kernel driver DPDK driver P4 support ×* ×* ✓ ✓
test environment
Development
✓ ✓ ✓ ×
wo/ hardware
Simulator/
× ✓ ✓ ×
emulator
Fig. 20. AMD Pensando SSDK.
Special licensing × × ✓ ✓
*While P4 is not the main language used for programming the packet
processing engine, it can be used for programming the CPU cores (e.g., with
extensions consist of pre-optimized, application-specific mod- P4 DPDK).
ules bundled into packages that run on the Base SDK. No-
table extensions include OvS-DPDK, Vector Packet Processor
(VPP), secure key storage, trusted execution environment, etc. Insight [104]) that offers comprehensive insights into resource
Furthermore, the OCTEON SDK provides a cycle-accurate utilization. This includes details such as the location of specific
simulator. This simulator enables the developers to test the match-action tables, the utilization of hash bits, and the usage
behavior of their programs with precision and accuracy in of SRAM/TCAM. The public documentation does not provide
software. clear specifics on how the compiler differs between Tofino
3) AMD Pensando SSDK: The AMD Pensando SDK facili- switches and SmartNICs.
tates software development for the AMD Pensando SmartNIC. 5) SDKs for ASIC SmartNICs Comparison: Table IX com-
This comprehensive SDK includes a P416 compiler, debugging pares the four SDKs. The characteristics compared include the
tools, a DPDK driver, example codes, and thorough documen- supported SmartNIC models, P4 language support, develop-
tation (see Fig. 20). Specifically, P416 can be used to write ment feasibility with or without dedicated hardware, availabil-
code for execution in the programmable pipeline. C and C++ ity of simulators or emulators for testing, and the necessity for
are used to write code for the CPU core complex. Additionally, special licensing. The AMD Pensando and Intel SmartNICs are
the SDK allows invoking the SmartNIC’s built-in domain- P4 programmable and thus, their SDKs provide a P4 compiler.
specific accelerators. The NVIDIA BlueField and the Octeon SDK only support P4
Similar to DOCA, developers have the flexibility to com- for their CPU cores (e.g., through P4-DPDK). Furthermore, all
pile applications without the SmartNIC hardware. However, SDKs except Intel/Barefoot SDE offer development without
unlike DOCA, Pensando SDK provides a simulator, allowing dedicated hardware, and Pensando SSDK and Octeon SDK
developers to test their ideas before uploading the image provide simulators or emulators for testing purposes. The
to the hardware. This validation capability becomes partic- Pensando SSDK and the Intel SDE require the customer to
ularly advantageous when integrating the SDK and simulator sign a Non-disclose Agreement (NDA) to get the license for
into CI/CD-based development and workflows. The simula- the SDKs.
tor boasts machine-register accuracy, ensuring that any code
developed for it can be cross-compiled to run seamlessly E. Vendor-specific SDKs - FPGAs
on the real hardware. The simulator serves as a valuable The following SDKs are proprietary and target FPGA-based
tool for validation, speeding up development, and simplifying SmartNICs.
debugging processes within a virtualized environment. 1) Vitis Networking P4: Vitis Networking P4 [105], de-
The reference applications included with the AMD Pen- veloped by AMD Xilinx, is the development environment for
sando SDK include a basic skeleton hello world, Software their FPGA SmartNICs. This high-level design environment
Defined Network (SDN) policy offload with Longest Prefix greatly simplifies the creation of packet-processing data planes
Matching (LPM), Access Control List (ACL), flow aging, through P4 programs (see Section V-A5). The tool’s primary
IPsec gateway, and other classic host offload such as TCP Seg- function is to translate the P4 design intent into a compre-
mentation Offload (TSO), checksum calculation, and Receive hensive AMD FPGA design solution. The compiler maps the
Side Scaling (RSS). control flow with a custom data plane architecture composed
4) Barefoot SDE / Intel P4 Studio: The compiler used for of various engines. This process involves selecting suitable
programming the pipeline on the Intel IPU has similarities with engine types and tailoring each one according to the specified
that used for programming the Tofino switches [81]. The com- P4 processing requirements. The architecture definition file for
piler was originally developed by Barefoot Networks, which Vitis Networking P4 is named xsa.p4. This architecture follows
was acquired by Intel in 2021. This compiler was formerly the open-source P4 PNA architecture (see Section V-A2a).
known as Barefoot SDE, and now it has been rebranded as Fig. 21 illustrates the AMD Vivado™ hardware tool flows
Intel P4 Studio. It is well-established and has undergone ex- designed for AMD Vitis™ Networking P4 implementations.
tensive revisions and optimizations. Additionally, the compiler There is a flow for the software, which is used for testing the
is equipped with a Graphical User Interface (GUI) tool (P4 behavior of the P4 program. The other flow is for the hardware.
15
.p4 User’s P4
program
.p4
Vitis Networking P4 IP Intel P4 compiler
P4C-vitisnet P4C-vitisnet for FPGA
compiler compiler
.p4 Software (CPU)
Data plane Control plane
.json .sv Custom arch. RTL APIs Control plane
include file application
Intel P4 FPGA
Launch Run synthesis/ .v .json Software Framework
simulation (RTL) implementation Custom
arch. RTL
.bit
.v
.meta
Behavioral model
.meta FPGA dev. (e.g., FPGA FPGA
RTL simulator HW test Intel Quartus) Bin
p4bm-vitisnet-cli hardware
.pcap .pcap
Custom
arch. RTL
Fig. 21. AMD Xilinx’s Vitis Networking P4 software and hardware flows.
Fig. 22. Intel P4 Suite for FPGA workflow.
Csadompu
(Section VII) (Section VIII) (Section IX) (Section X)
Firewalls and packet filters Switching / routing Storage initiator Machine learning
Open vSwitch NVMe-oF initiator offload ML training
Intrusion detect/prevention
ML inference
Tunneling and overlay Target initiator
Flow bypass
Key-value stores
Deep Packet Inspection (DPI) VxLAN, GRE, Geneve NVMe-oF target offload
Custom IDP/IPS functions Data replication
Observability and telemetry Compression
Ordering
Data in transit encryption
Network observability Deflate, zlib, SZ3
IPSec offload System observability Transaction processing
TLS offload VM/containers observability Scheduling
Data at rest encryption Aggregation
Load balancing
Serverless computing
L4-L7 load balancers
Receive Side Scaling (RSS) Lambda on NIC
Heterogeneous devices
5G User Plane Function
Specific FPGA Interface Managers (FIMs) can be developed, leverages established tools like Storage Performance Devel-
making use of the Open Programmable Acceleration Engine opment Kit (SPDK), DPDK, and P4 to facilitate network
(OPAE) SDK. OPAE, which is a subset of OFS, is a software and storage virtualization, workload provisioning, root-of-trust
layer comprising various API libraries. These libraries are used establishment, and various offload capabilities inherent to the
when programming host applications that interact with the platform. IPDK is a sub-project of OPI.
FPGA accelerator. IPDK already supports multiple targets including P4 DPDK,
OCTEON SmartNICs, Intel IPU, Intel FPGA, and Tofino-
based programmable switch [109].
F. Vendor-agnostic IPDK has two main interfaces: 1) Infrastructure Application
1) Open Programmable Infrastructure (OPI): The OPI is a Interface; and 2) Target Abstraction Interface. The Infrastruc-
community-driven initiative focused on creating common APIs ture Application Interface serves as the northbound interface
to configure and manage different SmartNIC targets [108]. of the SmartNIC, encapsulating the diverse range of Remote
Instead of relying on vendor-specific SDKs, developers can Procedure Calls (RPCs) supported within IPDK. The Target
use OPI’s standardized APIs to activate services, effectively Abstraction Interface represents an abstraction provided by an
abstracting the complexities associated with vendor-specific infrastructure device (e.g., SmartNIC) that runs infrastructure
SDKs. Consider Fig. 23. The developer uses gRPC and REST applications for connected compute instances. These instances
APIs to initial calls to the API gateway. The gateway acts as could include attached hosts and/or VMs, which may or may
a load balancer between four shim APIs: network, storage, not be containerized.
security, and AI/ML. These shim APIs then translate the calls 3) SONIC-DASH: SONiC, an open-source operating sys-
to the hardware accelerators through the vendor-specific SDKs. tem for network devices, has experienced significant growth
With such a design, portability can be ensured across various [110], [111]. The SONiC community has introduced a new
targets. Note that the developers can still execute functions open-source project called DASH (Disaggregated APIs for
provided by the vendor if they are not available through the SONiC Hosts) aiming at being an abstraction framework for
OPI APIs. SmartNICs and other network devices. It consists of a set of
2) Infrastructure Programmer Development Kit (IPDK): APIs and object models which cover network services for
The IPDK is an open-source, vendor-agnostic framework the cloud. The initial objective of DASH is to enhance the
comprising drivers and APIs tailored for infrastructure offload performance and connection scale of SDN operations, aiming
and management tasks. It is versatile and capable of running to achieve a speed increase of 10 to 100 times compared
on a range of hardware platforms including SmartNICs, CPUs, to software-based solutions in today’s clouds and enterprise.
or switches. Operating within the Linux environment, IPDK DASH’s ecosystem includes a community of cloud providers,
17
hardware suppliers, and system solution providers. be processed by a security function on the general-purpose
CPUs. This increases latency and decreases the throughput.
• Scalability: The CPU cores often struggle to inspect traffic
VI. O FFLOADED A PPLICATIONS TAXONOMY
at high rates, particularly in the absence of software acceler-
This section describes the systematic methodology that was
ators (e.g., DPDK). This can lead to high packet drop rates.
adopted to generate the proposed taxonomy. The results of
• Isolation: all traffic, including malicious traffic, is sent to
this literature survey represent derived findings by thoroughly
the host. This lack of isolation can pose security risks.
exploring the SmartNIC-related research works published in
• CPU usage: security functions consume a substantial portion
the last five years.
of the CPU processing power, particularly during periods of
Fig. 24 shows the proposed taxonomy. The taxonomy was
high traffic volume. This can result in performance bottle-
meticulously designed to cover the most significant works
necks and service degradation for end-user applications.
related to SmartNICs. The aim is to categorize the surveyed
works based on various high-level disciplines. The taxonomy To mitigate these issues, SmartNICs have been used to
provides a clear separation of categories so that a reader offload the security functions from general-purpose CPUs, see
interested in a specific discipline can only read the works Fig. 25 (d). Specifically, SmartNICs have been used to offload
pertaining to that discipline. firewall functionalities, IDS/IPS, DPI, and data-at-motion and
SmartNICs accelerate various infrastructure applications, data-at-rest encryption.
categorized primarily into security, networking, and storage
functions. It also accelerates various computing workloads A. Firewall
including AI/ML inference and training, caching (key-value
A firewall monitors incoming and outgoing network traffic
stores), transaction processing, serverless functions, and oth-
and allows or blocks packets based on a set of precon-
ers. Each high-level category in the taxonomy is further
figured rules. Firewalls typically operate up to layer-4 to
divided into sub-categories. For instance, various transaction
perform basic ACL operations. This means that the traf-
processing works belong to the sub-category “Transaction
fic can be matched against network layer information (e.g.,
processing” under the high-level category “Compute”. Addi-
source/destination IP addresses) and transport layer informa-
tionally, the survey offers performance comparisons between
tion (e.g., source/destination port numbers).
applications running on the host and those offloaded to the
Software-based firewalls are widely being used, especially
SmartNICs.
in cloud environments [116]. They are typically implemented
The subsequent subsections delve into the ongoing devel-
in conjunction with a virtual switch (e.g., OvS). With software-
opments within each of the aforementioned category, offering
based firewalls, traffic is inspected using the CPU cores of
insights into the lessons learned from these advancements.
the host where the firewall is running. This degrades the
performance and consumes the compute capacity of the CPU.
VII. S ECURITY
Recall that SmartNICs are equipped with a programmable
The landscape of data center traffic has undergone a signifi- pipeline or an embedded switch, where match-action rules
cant transformation with the rise of cloud-hosted applications
and microservices [112]. Traditionally, traffic patterns were 6 Sometimes referred to as traffic tromboning.
Offload Type Throughput (Gbps)
Traditional (Software) 12 18
SmartNIC (Hardware) 0.3
TABLE X
C OMPARISON BETWEEN VARIOUS WORKS OFFLOADING IDS/IPS FUNCTIONS TO S MART NIC S .
as application recognition (sometimes referred to as App-ID), cores of BlueField SmartNIC. The system uses the Analysis of
signature matching for malware, etc. Variance (ANOVA) statistical method for detecting anomalies.
SmartNICs are now integrating hardware-based RegEx en- Tasdemir et al. [127] implemented an SQL attack detection
gines. These engines perform pattern matching directly within system on the BlueField SmartNIC. The system uses NLP
the hardware, offering improved efficiency compared to tra- and ML classifiers to analyze and classify SQL queries.
ditional software-based approaches. Applications leveraging Miano et al. [128] implemented a DDoS mitigation system by
RegEx matching load a pre-compiled rule set into the engines combining hardware-based packet filtering on the SmartNIC
at runtime. This hardware-driven approach helps alleviate the and software-based packet filtering using XDP/eBPF.
performance concerns associated with DPI in IDS/IPS, making Table X summarises and compares the aforementioned
network security more robust and responsive. works that offload custom IDS/IPS functions to the Smart-
DPI has also been implemented on the hardware from NICs.
scratch (e.g., using an FPGA). Ceska et al. [132] proposed
an FPGA architecture for regular expression matching that can C. IPSec offload
process network traffic beyond 100Gbps. The system compiles
The Internet Protocol Security (IPSec) implements a suite
approximate Non-deterministic Finite Automata (NFAs) into a
of protocols to establish secure connections between end
multi-stage architecture. The system uses reduction techniques
devices. This is achieved through the encryption and authen-
to optimize the NFAs so that they can fit in the FPGA
tication of IP packets. IPsec comprises key modules includ-
resources. The system was implemented on Xilinx FPGA.
ing 1) key exchange, which facilitates the establishment of
Other works [133]–[136] have also explored optimizing NFAs
encryption and decryption keys through a mutual exchange
for FPGAs.
between connected devices; 2) authentication, which verifies
3) Offloading custom IPS/IDS functions: Zhao et al. [122]
the trustworthiness of each packet’s source; 3) encryption and
proposed Pigasus, an IDS that uses an FPGA to perform the
decryption, which encrypts/decrypts payload within packets
majority of the IDS functions, and a CPU to perform the
and potentially, based on the transport mode, the packet’s IP
secondary functions. Pigasus achieves 100Gbps with 100K+
header.
concurrent connections and 10K+ matching rules, on a single
IPSec has a data plane (DP) and a control plane (CP).
server. It requires on average five CPU cores and a single
The CP is responsible for the key exchange and session
FPGA-based SmartNIC. The system was tested using Intel
Stratix SmartNIC. Another FPGA-based solution proposed by
Chen et al. [123] is Fidas, which offloads rule pattern matching Plaintext packet IPSec encrypted packet
and traffic flow rate classification. Fidas achieves lower latency
and higher throughput than Pigasus. It was implemented on a Host Host
Xilinx FPGA. Zhao et al. [124] implemented an FPGA design Workload Workload
to analyze Internet of Things (IoT) traffic and summarize it
in real time. The CPU then uses a flow entropy algorithm to IPSec software
HW accelerators CPU
cores
detect the threats. (CP + DP) IPSec
OvS
Crypto
Panda et al. [125] proposed SmartWatch, a system that DP IPSec
Tunnel CP
combines P4 switches and SmartNICs to perform IDS/IPS Traditional
functions. The P4 switches perform coarse-grained traffic anal- NIC SmartNIC
ysis while the SmartNIC conducts the finer-grained analysis.
The SmartNIC used is Netronome Agilio. Wu et al. [126]
implemented an anomaly detection-based IDS on the CPU Fig. 28. IPSec on the host (a) and IPSec offloaded to the SmartNIC (b).
20
virtual switches can handle additional tasks such as Network Fig. 31. Throughput in million packets per second (Mpps) of software vs
Address Translation (NAT), tunneling, and QoS functionalities SmartNIC tunneling. (a) 114B packets and 64 connections; (b) 114B packets
such as rate limiting, policing, and scheduling. and 250,000 connections. Reproduced from [151].
22
Fig. 32. (a) Port mirroring; (b) TAP; (c) NetFlow export. (a) (b)
Fig. 33. VM and containers observability with (a) software switches and (b)
SmartNICs.
plemented in the embedded NIC switch [154] or the pro-
grammable pipeline [55]. Tunnels definition, which is part of
the control plane, is implemented in software, on the CPU example, the SmartNIC can provide telemetry data containing
cores of the SmartNIC. This design not only improves through- the CPU, memory, and disk usage of the host.
put and reduces latency for the encapsulation/decapsulation 3) VM and Containers Observability with SmartNICs: Ex-
operations, but also frees up CPU cycles on the host for other ternal approaches to packet observability cannot observe inter-
tasks. Fig. 31 compares the tunneling performance between the VM/container traffic within the same server. While software-
software and a BlueField SmartNIC, reproduced from [151]. based approaches for monitoring VMs and containers exist, see
With 114-byte packets and 64 connections, the SmartNIC Fig. 33 (a), they often burden the CPU, especially with high
tunneling is ∼60 times higher than the software-based. With traffic rates [163]. SmartNICs provide hardware visibility on
114-byte packets and 250,000 connections, the SmartNIC traffic between VMs or containers within the same server (see
tunneling is ∼20 times higher than the software-based. Fig. 33 (b)), alleviating the CPU burden on the host.
Plaintext
C. Observability - Monitoring and Telemetry
packet
Host D. Load Balancing
V V V
Observability is the ability to collect
M M and
extract telemetry
M Load balancers play a crucial role in modern cloud envi-
information. During a network outage, effective observability ronments by distributing network requests across servers in
facilitates diagnosing and troubleshooting problems. It can also
Software switch
data centers efficiently. Traditionally, load balancers relied on
help in detecting malicious events Traditional
and identifying network specialized hardware, but now software-based solutions are
NIC
performance bottlenecks. prevalent among cloud providers. This shift offers flexibility
Traditional packet observability solutions are typically im- and allows for on-demand provisioning on standard servers,
plemented in hardware, situated outside the server. Examples though it comes with higher provisioning and operational
include configuring port mirroring (e.g., Switched Port Ana- expenses. While software-based load balancers offer greater
lyzer (SPAN)) on switches/routers, see Fig. 32 (a), deploying customization and adaptability compared to hardware-based
network TAPs for replicating packets, see Fig. 32 (b), and counterparts, they also entail considerable costs for cloud
exporting flow-based statistics using NetFlow [155] or IPFIX providers due to server purchase expenses and increased
[156] to a remote collector, see Fig. 32 (c). energy consumption.
1) Offloading Packets Observability to SmartNICs: The Load balancers are categorized into two main types: Layer
traditional approaches to packet observability are all supported 4 (L4) and Layer 7 (L7). L4 load balancers function at
by SmartNICs. SmartNICs can mirror packets and send them the transport layer of the network stack. They associate a
to remote collectors. They can also export telemetry using Virtual IP address (VIP) with a list of backend servers, each
flow-based telemetry solutions like NetFlow or IPFIX, or using having its own dynamic IP (DIP) address. Routing decisions
packet-level telemetry streaming such as In-band Network made by L4 load balancers are based solely on the packet
Telemetry (INT) [157] and In-situ OAM [158]. SmartNICs headers of the transport/IP layers, considering factors such as
can also monitor and aggregate telemetry locally, which avoids source and destination IP addresses and ports. Thus, L4 load
excessive traffic exports. Furthermore, since they incorporate balancers do not inspect the payload content of the packets.
programmable pipelines, they can be used to implement more On the other hand, L7 load balancers operate at a higher
complex packet telemetry than the traditional ones. For exam- layer, specifically the application layer. These balancers are
ple, it is possible to implement streaming algorithms such as more intricate, as they analyze content within the packets,
the Count-min Sketch (CMS) [159] to estimate the number of particularly focusing on application-layer protocols like HTTP.
packets per flow in a scalable way, or a Bloom Filter [160] The L7 load balancer directs incoming requests to appropriate
to test the occurrence of an element in a set. Such telemetry backend servers based on the specific service being accessed.
information can be very useful for a variety of applications For instance, differentiation may occur based on URLs.
(e.g., security [161], performance analysis [162], etc.). 1) Offloading Load Balancing to SmartNICs: Several
2) Offloading System Observability to SmartNICs: The works have offloaded load balancing to SmartNICs. Cui
SmartNIC also offers supplementary telemetry data related to et al. [164] proposed Laconic, a system that improves the
the system in which it is located [163], such as the host. For performance of load balancing due to three key points: 1)
23
Header fields
Control plane
Received packet Data
N3 User plane N6
Network
function (UPF)
Hash 1 User Equipment Radio Access
function Indirection (UE) Network (RAN)
table
2 Fig. 35. 5G network architecture. The packet core is implemented as VNF
LSB
on general-purpose CPUs. The SmartNIC is being used to offload the UPF
Hash value functions.
3
1
on general-purpose CPUs rather than dedicated appliances.
N
General-purpose CPUs are not capable of guaranteeing high
throughput and low latency, which are the requirements and
Fig. 34. Receive Side Scaling (RSS). the Key Performance Indicators (KPI) of 5G networks.
1) UPF offload to SmartNIC: The SmartNIC can be used
to offload the UPF functions [173]. Specifically, the following
Lightweight network stack: unlike traditional L7 load bal- functions are offloaded: GTPU tunneling: the encapsulation
ancers, which heavily rely on the operating system’s TCP and decapsulation of packets run at line rate; policing: the
stack, Laconic opts for a lighter packet forwarding stack on SmartNIC will control the bit rates of the devices so that
the load balancer itself. This approach minimizes overhead they do not exceed the Maximum Bit Rate (MBR); statistics:
and leverages the end-hosts to achieve the desired end-to- the counters and metrics are calculated and used for billing
end properties; 2) Lightweight synchronization for shared purposes; QoS: the SmartNIC performs Differentiated Services
data structures: Laconic implements a concurrent connection Code Point (DSCP) on flows to enable 5G QoS; Load balanc-
table design based on the cuckoo hash table. This design ing: the SmartNIC balances the traffic to the corresponding
efficiently manages hash conflicts and reduces the number application; Network Address Translation (NAT): the Smart-
of entries needing probing during lookups; 3) Acceleration NIC translates IP addresses on traffic; etc.
with hardware engines: Laconic optimizes packet processing Offloading the UPF will not only improve throughput and
by transferring common packet rewriting tasks to hardware reduce latency, but it will also boost the number of users
accelerators. This strategy alleviates the processing burden per server (7x according to [173]) and lower the Capital
on the CPU cores of the SmartNIC. Huang et al. [165] Expenditure (CapEx) per user.
offloaded the load balancer to an FPGA-based SmartNIC.
The result shows that the system was able to load-balance at F. Summary and Lessons Learned
100Gbps. Chang et al. [166] described a scheme that finds an
optimal load balancing strategy for a network topology. It uses SmartNICs significantly improve the performance of net-
SmartNICs and programmable switches. Other works [167]– work functions and reduce their CPU consumption on the
[170] used variations of the methods above for load balancing hosts. The key takeaways are:
• The packet switching functions (i.e., matching header fields
2) Receive Side Scaling (RSS): SmartNICs commonly in- and taking actions), can be accelerated with SmartNICs.
clude an accelerator for RSS, which is a mechanism to This is because SmartNICs, whether they use a NIC switch
distribute incoming network traffic across multiple CPU cores. or a programmable pipeline, have lookups and ALUs im-
To achieve this, the SmartNIC calculates a hash value (Toeplitz plemented in hardware.
hash [171]) based on header fields (such as the five-tuple) of • The performance of tunneling operations (encapsula-
the received network packet, see Fig 34. The hash value’s tion/decapsulation) can be significantly improved when of-
Least Significant Bits (LSBs) are then used as indices for floaded to the SmartNIC. This also frees the CPU cores
an indirection table, the values of which are used to allocate that were previously used for performing the tunneling
the incoming data to a specific CPU core. Some SmartNICs operations.
allow steering packets to queues based on programmable filters • SmartNICs not only support traditional telemetry solutions
IX. S TORAGE
host. Requests from applications are simply forwarded to a
Traditionally, storage devices were directly attached to indi-
lightweight NVMe driver on the host. The initiator stack on
vidual computers or servers. This method provided fast access
the SmartNIC leverages the hardware accelerators for tasks
to data but lacked scalability and centralized management.
like inline cryptography and CRC offloading. The TCP stack
Network Attached Storage (NAS) emerged as a solution to
can either remain on the CPU cores of the SmartNIC or be
these limitations. It involves connecting storage devices to
offloaded to the hardware itself, depending on performance
a network, allowing multiple users and clients to access the
considerations and SmartNIC capabilities. The division of
storage resources over the network. NAS provided file-level
NVMe-oF functions between hardware and software allows for
access to data. Storage Area Network (SAN) provides a
optimization based on performance and SmartNIC capabilities.
high-speed network that connects storage devices to servers,
Another offload to the SmartNIC is the NVMe-oF RDMA.
providing block-level access to storage resources. SANs offer
The NVMe/RDMA data path is implemented in the hardware,
higher performance and scalability compared to NAS.
with inline cryptography and CRC offloaded. This approach
Traditional remote storage mechanisms establish a connec-
offers a high-performance, low-latency solution.
tion between a local host initiator and a remote target. This
process heavily burdens the host CPU, leading to a significant
decrease in overall performance. SmartNICs can be used to B. NVMe-oF Target
offload the processing from the host CPU. Another offload opportunity is offloading the storage tar-
get functions. On a storage target such as JBOF supporting
NVMe-oF, there is a CPU positioned between the network
A. NVMe-oF Initiator and NVME SSDs, see Fig. 37 (a). This CPU runs software
Non-Volatile Memory Express (NVMe) is an interface responsible for converting NVME-over-Fabrics Ethernet or
specification for accessing a computer’s non-volatile storage InfiniBand signals to NVME PCIe signals. The software com-
media usually attached via the PCI Express bus. It is typi- prises various components, including a network adapter stack,
cally used for accessing high-speed storage devices like Solid NVME-over-Fabrics stack, operating system block layer, and
State Drives (SSDs). NVMe over Fabrics (NVMe-oF) extends NVME SSD stack. Both the network adapter and SSD utilize
NVME to operate over network fabrics such as Ethernet, queues and memory buffers to interface with different software
Fibre Channel, or InfiniBand. The NVMe initiator initiates stacks.
and manages communication with NVMe targets. It sends When a request originates from the network, it arrives at
commands to NVMe targets to read, write, or perform other the network adapter as an RDMA SEND with the NVME
operations. The NVMe target refers to the NVMe storage command encapsulated. The adapter then forwards it to its
device itself. driver on the target CPU, which further passes it to the NVMe-
Fig. 36 (a) shows the traditional method of NVMe-oF using oF target driver. The NVME command proceeds through the
the TCP protocol and a regular NIC. The entire NVMe-oF driver for the SSDs and then to the NVME SSD controller.
initiator software stack operates on the host. Tasks such as Subsequently, the response follows the reverse path through
cryptography and CRC computations further strain the host the software layers.
CPU and memory bandwidth. 1) NVMe-oF Target Offload: With the offload, the fast path
1) NVMe-oF Initiator Offload: The NVMe-oF initiator is shifted to the hardware on the SmartNIC. Instead of bur-
functionality can be offloaded to the SmartNIC (Fig. 36 (b)), dening CPU cycles with millions of Input/Output Operations
minimizing the overhead on the host. The SmartNIC exposes per Second (IOPS), the adaptor now handles the load using
a high-performance PCIe interface and NVMe interface to the specialized function hardware. Software stacks remain in place
25
100 10000
75
50 1000
25
0 100
DEFLATE zlib SZ3 DEFLATE DEFLATE zlib SZ3 DEFLATE
(CPU) (CPU) (CPU) (SmartNIC) (CPU) (CPU) (CPU) (SmartNIC)
Fig. 38. CPU utilization during compression with various algorithms (DE- Fig. 39. Compression time with various algorithms (DEFLATE, zlib, SZ3)
FLATE, zlib, SZ3) on seven datasets. Reproduced from [175]. on seven datasets. Reproduced from [175].
for management traffic. The reduction in latency by removing • Due to the hardware accelerators in the SmartNIC (e.g.,
the software from the data path is by a factor of three [174]. compression, crypto), storage operations like compression,
Moreover, the CPU usage with offload is negligible. deduplication, and crypto will run faster than on the host’s
CPU.
• The SmartNIC can be deployed on the initiator or the
C. Compression and Decompression
storage target. In both deployments, the CPU usage on the
The surge in data volumes has caused performance bottle- hosting device is negligible, the latency is minimized, and
necks for storage applications. Data compression is a widely the number of IOPS is improved.
adopted technique that mitigates this bottleneck by reduc-
ing the data size. It encodes information using fewer bits X. C OMPUTE
than the original representation. Notably, machine learning, This section examines applications offloaded to the Smart-
databases, and network communication rely on compression NIC that are not specifically tailored to infrastructure func-
techniques—both lossless and lossy—to enhance their per- tions. Instead, these applications leverage the SmartNIC for
formance. Data compression is compute-intensive and time- accelerated computing tasks.
consuming, especially with large sizes of data to be com-
pressed.
A. Machine Learning
1) Offloading Compression to SmartNICs: SmartNICs in-
clude onboard hardware accelerators that enable the offloading State-of-the-art deep ML models have significantly ex-
of compression and decompression tasks from host CPUs. This panded in size, playing a critical role in various domains, in-
offloading alleviates the strain on host resources, resulting in cluding computer vision, Natural Language Processing (NLP),
savings and improved performance. Fig. 38 shows the CPU and others [47]. The scale of these models has seen a dramatic
utilization when compression is executed entirely on the host increase, with the number of parameters growing from 94
(denoted as CPU) versus when executed on the compression million in 2018 [179] to 174 trillion in 2022 [180]. This
hardware engine of the SmartNIC (denoted as SmartNIC). The exponential growth owes much to advancements in parallel
experiment shows results for various compression algorithms and distributed computing, enabling tasks related to model
(e.g., DEFLATE [176], zlib [177], SZ3 [178]) over seven training to be distributed across multiple computing resources
datasets. The datasets are sorted in the figure by their sizes in simultaneously. The practice of offloading parts of ML tasks
ascending order– each dataset is a column in the figure. The to network resources traces back to the 2000s [181], a trend
experiment is reproduced from [175]. When the compression that continued with the advent of Software Defined Network-
is executed entirely on the host, the CPU usage approaches ing (SDN), where ML primarily operates within the control
100%, especially with large datasets. With a SmartNIC, there plane [182]. The recent emergence of programmable data
is a significant reduction in the CPU utilization. planes (i.e., programmable switches, SmartNICs) has further
Fig. 39 shows the compression time needed when executed spurred research and practical applications toward offloading
entirely on the host (denoted as CPU) versus when executed ML phases, such as training and inference, to the hardware.
on the compression hardware engine of the SmartNIC. The Offloading ML tasks can occur on a single network device or
experiment is reproduced from [175]. With a SmartNIC, there across multiple devices, depending on network requirements
is a significant reduction in the compression time, regardless and the complexity of the offloaded ML task.
of the size of the dataset. 1) ML training: The training of large ML models can be
accelerated by following a distributed approach. This involves
computing gradients on each device based on a subset of the
D. Summary and Lessons Learned data, which are then aggregated to update model parameters.
Offloading storage functions to the SmartNICs improves the Additionally, optimization of model parameters can be carried
performance. out in the data plane to maximize accuracy.
26
C. Transaction Processing
Fig. 42. Honeycomb system architecture [199]. High-performance transaction processing is important to
enable various distributed applications. These systems need
to manage a large number of requests from the network
Another aspect of the key-value store that was offloaded to efficiently. One crucial aspect is determining how to schedule
the SmartNIC is the ordering of elements. Ordered key-value each transaction request to the most suitable CPU core.
stores enable additional applications by allowing an efficient Consider Fig. 43 (a) which shows the architecture of a trans-
SCAN operation. Liu et al. [199] proposed Honeycomb, an action processing system without scheduling. A traditional
FPGA-based system that provides hardware acceleration for NIC receives requests from the clients and dispatches them
an in-memory ordered key-value store. It focuses on the to the worker threads. The worker threads then execute the
read-dominated workloads. Consider Fig. 42. The B-Tree transaction, while considering the contention issues that might
accelerator implements the GET and SCAN operations. The happen. Contention in this context means that two workers are
CPU executes the PUT, UPDATE, and DELETE operations. accessing the same data and at least one of them is issuing
The B-Tree is stored on the onboard DRAM in FPGA and a write. In Fig. 43 (a), two transactions (txn0 and txn1 ) are
on the memory of the host. Storing the B-Tree on the host writing to the same data blocks A and C. In such a scenario,
allows better scalability since its memory is larger than that the transactions are typically aborted, causing the clients to
of the FPGA. The memory subsystem maintains a cache and resend the transactions, which degrades the performance.
communicates with the onboard DRAM. It also communicates Li et al. [202] proposed using a SmartNIC to schedule the
with the host memory using PCIe. The implementation shows transactions to the appropriate worker threads. The Smart-
that the system increases the throughput of another ordered NIC maintains the runtime states, giving it the flexibility to
key-value store [200] by 1.8x. make accurate scheduling decisions. The SmartNIC queues
Chen et al. [201] designed a heterogeneous key-value store the transactions belonging to the same worker thread. This
where a primary instance runs on the host and a secondary avoids having the clients resend the transactions. The system is
instance runs on a SmartNIC. The system identifies the popular implemented on an FPGA-based SmartNIC, which further re-
items and replicates them to the SmartNIC. The popular duces the scheduling overhead. The system was implemented
items are identified with moving window access counters. over the Innova-2 SmartNIC and the results show that the
The server instance serves the read and write requests of throughput is boosted by 2.68x and the latency is reduced by
all keys while the SmartNIC instance serves only the read 48.8% compared to the CPU-based scheduling.
request of popular items. This system targets read-intensive Schuh et al. [203] implemented Xenic, a SmartNIC-based
workloads with skewed access. The system was implemented system that applies an asynchronous, aggregated execution
on a BlueField-2 and the results show that the throughput is model to maximize network and core efficiency. It uses a
improved by 1.86x than a standalone RDMA key-value store. data store on both the SmartNIC and the host. This data
store provides fast access to host data via indexing. It also
maintains a state to enhance the concurrency and contention
clients clients issues. Xenic also aggregates work at all inputs and outputs
clients clients
of the SmartNIC to achieve communication efficiency. The
txn0: txn1:
SmartNIC system was implemented on a LiquidIO SmartNIC. The results
{A=1, C=0} {A=0, C=1} Runtime states
show that Xenic improves the throughput of prior RDMA-
id=sched(txn)
based systems by approximately 2x, reduces the latency by
txn2 txn0,1 up to 59%, and saves server threads.
worker threads
worker threads
D. Serverless Computing
Contention
data:{A, B, C, …} data:{A, B, C, …} Figure 44 shows the architectures used in the cloud today.
(a) (b)
The server virtualization allows guest operating systems to run
on top of a host operating system. The applications and the
Fig. 43. Transaction processing systems. (a) a system without scheduling, (b) libraries run on top of the guest OS and are isolated from other
scheduling using SmartNIC. Reproduced from [202]. operating systems. The trend has shifted towards container
28
Workload [108]
providers have been using the isolate functions architecture in and inference, experience significant performance enhance-
which the functions are executed on a bare metal server. ments when offloaded to SmartNICs. These devices effi-
1) Executing Lambda Functions on SmartNICs: Recent ciently aggregate model updates from multiple ML work-
efforts have explored the potential of executing Lambda ers and optimize model parameters. Their programmable
functions on the SmartNICs. Choi et al. [204] proposed λ- pipeline also enables the execution of certain ML models
NIC, a framework where Lambda functions are executed on directly for line-rate inference.
• Key-value stores operations, which include retrieving and
the SmartNIC. It provides a programming abstraction, which
resembles the match-action of the P4 language, to express updating data, replicating stores, and detecting failures,
the lambda functions. The framework analyzes the memory can be offloaded to SmartNICs. This would bring notable
accesses of the functions to map them across the memory throughput and latency improvements.
• SmartNICs can be used to schedule transactions, aggregate
hierarchy of the SmartNIC. Because the workloads are short-
lived, λ-NIC assigns a function to a single core on the values, and solve contention in distributed systems, improv-
SmartNIC. The system was implemented on a Netronome ing the latency and throughput.
• SmartNICs can execute serverless workloads (lambda func-
Agilio CX, and the results show that λ-NIC can decrease the
average latency by 880x and improve the throughput by 736x. tions), which reduces the load on the servers. They can also
be used as an additional execution engine in a heterogeneous
Tootaghaj et al. [22] proposed SpikeOffload, a system that
data and compute cluster.
offloads serverless functions to the CPU cores of the Smart-
NICs, in the presence of transient traffic spikes, see Fig. 45.
A workload collector module gathers the history of workloads XI. C HALLENGES AND F UTURE T RENDS
and feeds the summary to the workload manager module. The In this section, several research and operational challenges
workload manager module predicts the workload spikes based that correspond to SmartNICs are outlined. The challenges are
29
F3 F2
B. Non-optimized P4 Codes F2 F4 F3 F4
F7 F7
Developers have been using low-level optimization to en- F1 F1
hance the performance of packet processing in SmartNICs. F5 F6 F5 F6
Recently, vendors are embracing P4 as a uniform program-
(a) (b)
ming model for SmartNICs [55], [81], [94]. While P4 allows
ease of programming and offers a high-level standardized Fig. 48. (a) Non-optimized placement; (b) optimized placement. The inter-
model, it does not guarantee the optimal performance on device transmissions (red arrows) between SmartNIC and CPU lead to
SmartNICs. This is because the P4 compilers were optimized additional element graph latency. Reproduced from [211].
30
Logically-centralized Logically-centralized
16
Normalized latency
1 RPC/PCIe
ASIC
Data Plane Data Plane FPGA
NAT DPI FW LPM HH
(a) SDN in theory (b) SDN in reality
Fig. 49. Normalized latency for different implementations of functions:
Network Address Translation (NAT), DPI, Firewall (FW), LPM, Heavy Hitter
Fig. 50. (a) SDN in theory; (b) SDN in reality. Reproduced from [209].
(HH). Reproduced from [213].
Second, although it is technically possible to host switching an unported function on a hypothetical SmartNIC target.
exclusively on the SmartNIC, doing so incurs considerable Initially, it constructs a model for a given SmartNIC. Then,
latency costs for packets moving between the functions de- it creates performance profiles for that SmartNIC by con-
ployed on the SmartNIC and those on the host. This is due to ducting hardware microbenchmarks, which encompass tests
the overhead caused by the multiple traversals across the host on memory latency, accelerator throughput, etc. Clara then
PCI bus. Third, distributing the functions between the host and creates a code and examines it to identify segments that could
the SmartNIC introduces management challenges. be fully offloaded to the SmartNIC. It evaluates the optimal
Current and Future Initiatives: Le et al. [210] presented mapping by incorporating constraints derived from the logical
UNO, a system that splits switching between the host software NIC model, performance parameters, and code segments. By
and the SmartNIC. It uses linear programming formulation resolving these constraints, Clara can establish a mapping that
to determine the optimal placement for functions. UNO uses optimizes performance after porting. Finally, Clara tests with
the traffic pattern and the load of the function as input. The a PCAP file and assesses how packets would traverse the
experiments show that the savings in processors is up to eight mapping, thereby providing predictions regarding latency and
host cores. UNO also reduces power by 2x. Another work throughput.
by Wang et al. [211] optimizes the placement of functions
according to the processing and the transmission latency. The E. Poor Security Isolation
system analyzes the dependencies and formulates the partition Commodity SmartNICs suffer from poor isolation between
and placement problem using 0-1 linear programming. The offloaded functions and between functions and data center
system minimizes The inter-device transmissions between the operators [212]. This limitation is a result of the limited access
SmartNIC and the CPU, see Fig. 48. controls on the NIC memory and the absence of virtualization
for hardware accelerators. These shortcomings compromise
D. Performance Unpredictability the robustness and security of individual functions, especially
in a multi-tenant environment. Additionally, any buggy or
When offloading a function to SmartNICs, developers must
compromised code within the NIC poses a risk to all other
refactor the core logic to align with the underlying hard-
functions running on it. Concrete attacks on popular Smart-
ware. Determining the optimal offloading strategy may not be
NICs including packet corruption, DPI rules stealing, and IO
straightforward. Moreover, the performance of ported func-
bus denial of service, are presented in [212].
tions can vary among developers, relying heavily on their un-
Current and Future Initiatives: Zhou et al. [212] proposed S-
derstanding of NIC capabilities, see Fig. 49. For instance, us-
NIC, a hardware design that enforces disaggregation between
ing the flow cache can offer orders of magnitude improvement
resources. S-NIC isolates functions at both the ISA level and
in latency compared to DRAM [213]. This is entirely related to
the microarchitectural level. This ensures integrity and confi-
how the programmer implements the code. The performance is
dentiality, as well as mitigating against side-channel attacks.
also influenced by traffic workloads (e.g., flow volumes, packet
The design is cost-effective and requires minimal changes to
sizes, arrival rates). Additional functions on the SmartNIC can
the hardware (e.g., die area). However, it still incurs modest
pose further challenges, particularly with memory-intensive
degradation in the performance. Future work could explore
functions potentially impacting cache utilization for others,
alternative architectures that have less impact on performance,
and compute-intensive functions potentially causing head-of-
or other software-based techniques to isolate the resources.
line blocking at accelerators [213]. All these factors often
lead to unexpected performance fluctuations when migrating
a function to a SmartNIC. While benchmarking the program F. Slow Path Bottleneck
will produce performance results, it requires that the program Over recent years, there has been a continuous improvement
be already developed on the SmartNIC. in the performance of packet-processing data planes, leading
Current and Future Initiatives: Performance prediction can to their predominant implementation in hardware such as
help the developer gain insight prior to porting the code SmartNICs and programmable switches. Yet, there has been a
to the hardware. Clara [213] predicts the performance of lack of focus on the slow path, the interface between the data
31
plane and the control plane, which is traditionally considered can also be used in complex neural network models that
non-performance critical. The slow path is responsible for need to be simplified to fit in the data plane. To reduce
handling a subset of traffic that requires special processing communication overhead, Ma et al. [186] compresses the
(complex control flow, compute, memory resources). These parameters (i.e., gradients) before sharing them in the network.
tasks cannot be executed on the data plane, see Fig. 50. The Such approaches can enhance network performance, especially
slow path is executed on the CPU cores, whether on the host when numerous networking devices are cooperating.
or the SmartNIC.
Lately, the slow path is becoming a major bottleneck, driven H. Lack of Training Resources
by the surge in physical network bandwidth and the increasing
There is an evident lack of detailed documentation and train-
complexity of network topologies. There is a growth in slow-
ing resources that adequately cover SmartNIC programming
path traffic in tandem with user traffic.
and configuration. While some vendors may provide reference
Current and Future Initiatives: There is a need to re-evaluate
applications, basic documentation, and training courses (e.g.,
the current approach to balancing workload distribution be-
[220]), they often fall short of providing the in-depth expla-
tween the data plane and the slow path. Zulfiqar et al. [209]
nations and hands-on experience that developers need. This
articulated the limitations of the current slow path and argued
makes it difficult for newcomers to understand the intricacies
that the solution is to have a domain-specific accelerator for
of SmartNIC development and configuration.
the slow path. A challenge with creating such an accelerator
Current and Future Initiatives: To address this issue, it
is to design a generic architecture with common primitives
is essential for vendors to invest in creating comprehensive
that support most of the slow path use cases. Ideally, the
training materials, including detailed documentation, tutorials,
accelerator would have predictable response times, fast table
and hands-on labs. These resources should cover various
updates, and support large memory pools. Further, the paper
aspects of SmartNIC programming and configuration, from
advocates extending the match-action model found in most
basic concepts to advanced techniques. Additionally, vendors
packet processing devices to match-compute for the slow path.
could offer interactive online courses or workshops led by
G. ML Offload Complexity experienced instructors to provide personalized guidance and
support for learners. Some YouTube channels are posting the
Offloading the training or the inference in ML from the
latest advances and updates on SmartNICs (e.g., STH [214],
CPU/GPU to SmartNICs comes with a set of challenges that
SNIA [215], OPI [216]). However, they are still not compre-
limit the scalability and innovation of the deployed models.
hensive enough to allow a beginner to start experimenting with
• Accuracy vs Compatibility tradeoff. Some hardware archi-
SmartNICs.
tectures do not support floating-point numbers and complex
operations, which are required by advanced ML models,
XII. C ONCLUSION
such as neural networks. Workarounds that are proposed
to overcome these limitations come at the expense of The evolution of computing has encountered significant
sacrificing the accuracy of the ML model. challenges with the end of Moore’s Law and Dennard Scal-
• Restriction on the adopted ML algorithm. Despite the con- ing. The emergence of SmartNICs, which combine various
tinuous exploration of deploying ML models, such as neural domain-specific processors, represents a pivotal shift towards
networks and decision trees in SmartNICs, a multitude of offloading infrastructure tasks and improving network effi-
algorithms, such as Principal Component Analysis (PCA), ciency. This paper has filled a critical void in the literature by
Genetic Algorithms, are yet to be explored. Additionally, providing a comprehensive survey of SmartNICs, encompass-
models that are currently deployed are static and any update ing their evolution, architectures, development environments,
to the model requires temporarily halting the programmable and applications. The paper has delineated the wide array of
network device until the new model is compiled and pushed. functions offloaded to SmartNICs, spanning network, security,
• Flexibility of aggregate functions: In the context of training storage, and compute tasks. The paper has also discussed
ML models, the traditional aggregate functions are ‘min’, the challenges associated with SmartNIC development and
‘max’, ‘count’, ‘sum’, and ‘avg’. However, over time, deployment, and pinpointed key research initiatives and trends
several approaches started adopting and providing user- that could be explored in the future. Evidence suggests that
defined aggregate functions. Implementing such functions SmartNICs are poised to become integral components of every
over some hardware architectures used in SmartNICs is not network infrastructure. Smaller networks, which often lack
straightforward. deep technical expertise, can leverage SmartNICs for offload-
Current and Future Initiatives: Migration of functionality ing routine infrastructure tasks. On the other hand, larger and
is one technique that can overcome the restrictions of up- research-oriented networks, with experienced developers, will
dating the data plane on the fly. For instance, before the leverage SmartNICs for offloading complex tasks that are not
programmable network processor is updated, its functionalities well-suited for general-purpose CPUs.
are migrated to another device so that network communica-
tion is not interrupted. To deal with the lack of support of ACKNOWLEDGEMENT
floating-points, approaches such as [217] translate floating- This work is supported by the National Science Foundation
point numbers to integers using quantization (i.e., a fixed- (NSF), Office of Advanced Cyberinfrastructure (OAC), under
point representation of decimal numbers). Such technique grant numbers 2118311, 2403360, and 2346726.
32
Abbreviation Term
TABLE XII
A BBREVIATIONS USED IN THIS ARTICLE . SDK Software Development Kit
SDN Software Defined Network
Abbreviation Term SoC System on a Chip
SPAN Switched Port Analyzer
ACL Access Control List
SPDK Storage Performance Development Kit
AES Advanced Encryption Standard
SSD Solid State Drives
ALU Arithmetic Logic Unit
SVM Support Vector Machine
ANOVA Analysis of Variance
TCP Transmission Control Protocol
API Application Programming Interface
TLS Transport Layer Security
ASIC Application Specific Integrated Circuit
TM Traffic Manager
BCC BPF Compiler Collection
TRNG True Random Number Generator
BPF Berkeley Packet Filter
TSO TCP Segmentation Offload
CLI Command Line Interface
uBPF Userspace BPF
CMS Count-min Sketch
UE User Equipment
CPU Central Processing Unit
UPF User Plane Function
DNN Deep Neural Network
URL Uniform Resource Locator
DIP Dynamic IP
VIP Virtual IP
DOCA Data Center-on-a-Chip Architecture
VM Virtual Machine
DPDK Data Plane Development Kit
VPP Vector Packet Processor
DPI Deep Packet Inspection
VTEP VXLAN Tunnel End Point
DPU Data Processing Unit
VXLAN Virtual Extensible LAN
DRAM Dynamic Random Access Memory
XDP eXpress Data Path
eBPF Extended Berkeley Packet Filter
xPU Auxiliary Processing Unit
ESnet Energy Sciences Network
FPGA Field Programmable Gate Array
GPU Graphics Processing Units
GRE Generic Routing Encapsulation R EFERENCES
GUI Graphical User Interface
HDL Hardware Description Language [1] G. Moore, “Cramming more components onto integrated circuits,”
HPC High Performance Computing Proceedings of the IEEE, 1998.
IDE Integrated Development Environment [2] G. Moore, “Progress in digital integrated electronics,” in Electron
IDS Intrusion Detection System Devices Meeting, 1975.
IP Internet Protocol [3] R. Dennard, F. Gaensslen, H. Yu, V. Rideout, E. Bassous, and
IPDK Infrastructure Programmer Development Kit A. LeBlanc, “Design of ion-implanted MOSFET’s with very small
IPU Infrastructure Processing Unit physical dimensions,” IEEE Journal of solid-state circuits, 1974.
IPS Intrusion Prevention System [4] J. Hennessy and D. Patterson, Computer architecture: a quantitative
IPSec Internet Protocol Security approach. Elsevier, 2011.
IT Information Technology [5] G. Amdahl, “Validity of the single processor approach to achieving
JBOF Just a Bunch of Flash large scale computing capabilities,” in Proceedings of the April 18-20,
KPI Key Performance Indicators 1967, spring joint computer conference, 1967.
kTLS Kernel TLS [6] J. Faircloth, “Enterprise applications administration: The definitive
LAN Local Area Network guide to implementation and operations,” Morgan Kaufmann, 2013.
LUT Lookup Table [7] S. Ibanez, M. Shahbaz, and N. McKeown, “The case for a network
LPM Longest Prefix Matching fast path to the CPU,” in Proceedings of the 18th ACM Workshop on
LSB Least Significant Bit Hot Topics in Networks, 2019.
MBR Maximum Bit Rate [8] M. Metz, “SmartNICs and infrastructure acceleration report 2022,”
ML Machine Learning AvidThink, 2022.
NAS Network Attached Storage
[9] A. Ageev, M. Foroushani, and A. Kaufmann, “Exploring domain-
NAT Network Address Translation
specific architectures for network protocol processing,”
NFV Network Function Virtualization
[10] E. Tell, “A domain specific DSP processor,” Institutionen för sys-
NGFW Next-Generation Firewall
temteknik, 2001.
NIC Network Interface Card
[11] D. Caetano-Anolles, “Hardware - optimizations - SSD - CPU - GPU
NLP Natural Language Processing
- FPGA - TPU,” gatk, 2022.
NVMe Non-Volatile Memory Express
[12] G. Elinoff, “Data centers are overloaded. the inventor of FPGAs is
NVMe-oF Non-Volatile Memory Express over Fabric
swooping in with a “comprehensive” SmartNIC,” March 2020.
OFS Open FPGA Stack
OPAE Open Programmable Acceleration Engine [13] Google, “Encryption in transit.” [Online]. Available: https://tinyurl.co
OPI Open Programmable Infrastructure m/436vh9jh.
OS Operating System [14] J. Morra, “Is this the future of the SmartNIC?.” [Online]. Available:
OvS Open vSwitch https://tinyurl.com/ydru5bcp.
P4 Programming Protocol-independent Packet Processor [15] Microsoft, “Azure SmartNIC.” [Online]. Available: https://tinyurl.co
PCIe Peripheral Component Interconnect Express m/4sj7m7mp.
PISA Protocol Independent Switch Architecture [16] S. Schweitzer, “Architectures, boards, chips and software,” SmartNIC
PMD Poll Mode Driver Summit, 2023.
PNA Portable NIC Architecture [17] AMD, “AMD collaborates with the energy sciences network on launch
PSA Portable Switch Architecture of its next-generation, high-performance network to enhance data-
QoS Quality of Service intensive science,” 2022. [Online]. Available: https://tinyurl.com/
RAM Random Access Memory ycyb382t.
RAN Radio Access Network [18] VMware, “DPU-based acceleration for NSX.” [Online]. Available: ht
RDMA Remote Direct Memory Access tps://tinyurl.com/238v6j5h.
RPC Remote Procedure Call [19] Palo Alto Networks, “Intelligent traffic offload uses smartnic/dpu for
RSS Receive Side Scaling hyperscale security,” 2022. [Online]. Available: https://tinyurl.com/d3
RTL Register Transfer Level 22nda7.
SAN Storage Area Network [20] Juniper Networks, “SmartNICs accelerate the new network edge,”
2021. [Online]. Available: https://tinyurl.com/2uh6uh7t.
[21] S. Vural, “SmartNICs in telco: benefits and use cases,” 2021. [Online].
Available: https://tinyurl.com/8amw8s74.
33
[22] D. Tootaghaj, A. Mercian, V. Adarsh, M. Sharifian, and P. Sharma, [47] R. Parizotto, B. Coelho, D. Nunes, I. Haque, and A. Schaeffer-
“SmartNICs at edge for transient compute elasticity,” in Proceedings Filho, “Offloading machine learning to programmable data planes: A
of the 3rd International Workshop on Distributed Machine Learning, systematic survey,” ACM Computing Surveys, 2023.
2022. [48] W. Quan, Z. Xu, M. Liu, N. Cheng, G. Liu, D. Gao, H. Zhang, X. Shen,
[23] C. Zheng, X. Hong, D. Ding, S. Vargaftik, Y. Ben-Itzhak, and N. Zil- and W. Zhuang, “AI-driven packet forwarding with programmable data
berman, “In-network machine learning using programmable network plane: A survey,” IEEE Communications Surveys & Tutorials, 2022.
devices: A survey,” IEEE Communications Surveys & Tutorials, 2023. [49] J. Gomez, E. Kfoury, J. Crichigno, and G. Srivastava, “A survey on TCP
[24] I. Baldin, A. Nikolich, J. Griffioen, I. Monga, K.-C. Wang, T. Lehman, enhancements using P4-programmable devices,” Computer Networks,
and P. Ruth, “FABRIC: A national-scale programmable experimental 2022.
network infrastructure,” IEEE Internet Computing, 2019. [50] S. Han, S. Jang, H. Choi, H. Lee, and S. Pack, “Virtualization in
[25] GEANT, “GEANT testbed.” [Online]. Available: https://geant.org/. programmable data plane: A survey and open challenges,” IEEE Open
[26] GEANT, “High-performance flow monitoring using programmable Journal of the Communications Society, 2020.
network interface cards,” 2023. [51] J. Brito, J. Moreno, L. Contreras, M. Alvarez-Campana, and M. Blanco,
[27] E. da Cunha, M. Martinello, C. Dominicini, M. Schwarz, M. Ribeiro, “Programmable data plane applications in 5G and beyond architectures:
E. Borges, I. Brito, J. Bezerra, and M. Barcellos, “FABRIC testbed A systematic review,” Sensors, 2023.
from the eyes of a network researcher,” in Anais do II Workshop de [52] A. Mazloum, E. Kfoury, J. Gomez, and J. Crichigno, “A survey
Testbeds, 2023. on rerouting techniques with P4 programmable data plane switches,”
[28] D. Cerović, V. del Piccolo, A. Amamou, K. Haddadou, and G. Pujolle, Computer Networks, 2023.
“Fast packet processing: A survey,” IEEE Communications Surveys & [53] M. Chiesa, A. Kamisiński, J. Rak, G. Rétvári, and S. Schmid, “A survey
Tutorials, 2018. of fast recovery mechanisms in the data plane,” Authorea Preprints,
[29] E. Freitas, A. de Oliveira, P. do Carmo, D. Sadok, and J. Kelner, “A 2023.
survey on accelerating technologies for fast network packet processing [54] NVIDIA, “NVIDIA Mellanox BlueField-2 data processing unit
in Linux environments,” Computer Communications, 2022. (DPU).” [Online]. Available: https://tinyurl.com/yrky7ee5.
[30] L. Linguaglossa, S. Lange, S. Pontarelli, G. Rétvári, D. Rossi, T. Zin- [55] AMD, “Pensando DSC2-200 distributed services card.” [Online]
ner, R. Bifulco, M. Jarschel, and G. Bianchi, “Survey of performance Available: https://tinyurl.com/yr6eeez6.
acceleration techniques for network function virtualization,” Proceed- [56] AMD, “Xilinx Alveo SN1000 SmartNIC.” [Online]. Available: https:
ings of the IEEE, 2019. //tinyurl.com/pxacmnd9.
[31] X. Fei, F. Liu, Q. Zhang, H. Jin, and H. Hu, “Paving the way for [57] N. McKeown, “Why does the internet need a programmable forwarding
NFV acceleration: A taxonomy, survey and future directions,” ACM plane.” [Online]. Available: https://tinyurl.com/ffajhk9y.
Computing Surveys (CSUR), 2020. [58] J. Xing, Y. Qiu, K.-F. Hsu, S. Sui, K. Manaa, O. Shabtai, Y. Piasetzky,
[32] P. Shantharama, A. Thyagaturu, and M. Reisslein, “Hardware- M. Kadosh, and A. Krishnamurthy, “Unleashing SmartNIC packet
accelerated platforms and infrastructures for network functions: A processing performance in P4,” in Proceedings of the ACM SIGCOMM
survey of enabling technologies and research studies,” IEEE Access, 2023 Conference, 2023.
2020. [59] S. Kanev, J. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-
Y. Wei, and D. Brooks, “Profiling a warehouse-scale computer,” in
[33] M. Vieira, M. Castanho, R. Pacı́fico, E. Santos, E. Júnior, and L. Vieira,
Proceedings of the 42nd Annual International Symposium on Computer
“Fast packet processing with eBPF and XDP: Concepts, code, chal-
Architecture, 2015.
lenges, and applications,” ACM Computing Surveys (CSUR), 2020.
[60] NVIDIA, “ConnectX-5 EN Card.” [Online]. Available: https://tinyurl.
[34] L. Rosa, L. Foschini, and A. Corradi, “Empowering cloud computing
com/nhcf26nr.
with network acceleration: A survey,” IEEE Communications Surveys
[61] NVIDIA, “ConnectX-6 LX 25/50G Ethernet SmartNIC.” [Online].
& Tutorials, 2024.
Available: https://tinyurl.com/4at7npy5.
[35] The Linux Foundation, “DPDK.” [Online]. Available: https://www.dp
[62] NVIDIA, “ConnectX-6 Dx 200G Ethernet SmartNIC.” [Online].
dk.org/.
Available: https://tinyurl.com/2e59ts66.
[36] Ntop Engineering, “PF RING: High-speed packet capture, filtering and [63] NVIDIA, “ConnectX-7 400G Adapters.” [Online]. Available: https:
analysis.” [Online]. Available: https://tinyurl.com/yzwc4t35. //tinyurl.com/hndz6yxm.
[37] T. Marian, K. Lee, and H. Weatherspoon, “Netslices: Scalable multi- [64] Achronix, “Vectorpath accelerator card.” [Online]. Available: https:
core packet processing in user-space,” in Proceedings of the eighth //tinyurl.com/yc7xachz.
ACM/IEEE symposium on Architectures for networking and communi- [65] AMD, “Xilinx Alveo U50 Data Center Accelerator Card.” [Online].
cations systems, 2012. Available: https://tinyurl.com/nhbe4xbd.
[38] L. Rizzo, “Netmap: a novel framework for fast packet I/O,” in 21st [66] AMD, “Xilinx Alveo U55C Data Center Accelerator Cards.” [Online].
USENIX Security Symposium (USENIX Security 12), 2012. Available: https://tinyurl.com/mr4887yw.
[39] E. Kfoury, J. Crichigno, and E. Bou-Harb, “An exhaustive survey [67] AMD, “Alveo U200 and U250 Data Center Accelerator Cards.” [On-
on P4 programmable data plane switches: Taxonomy, applications, line]. Available: https://tinyurl.com/2p9tzav3.
challenges, and future trends,” IEEE Access, 2021. [68] AMD, “Alveo U280 Data Center Accelerator Card.” [Online]. Avail-
[40] F. Hauser, M. Häberle, D. Merling, S. Lindner, V. Gurevich, F. Zeiger, able: https://tinyurl.com/bdfzke7z.
R. Frank, and M. Menth, “A survey on data plane programming with [69] Napatech, “NT200A02 SmartNIC with Link-Capture Software.” [On-
P4: Fundamentals, advances, and applied research,” Journal of Network line]. Available: https://tinyurl.com/y4xbyypy.
and Computer Applications, 2023. [70] Silicom, “Silicom FPGA SmartNIC N501x Series.” [Online]. Avail-
[41] O. Michel, R. Bifulco, G. Retvari, and S. Schmid, “The programmable able: https://tinyurl.com/4s9mwr88.
data plane: Abstractions, architectures, algorithms, and applications,” [71] Silicom, “Silicom N5110A SmartNIC Intel based.” [Online]. Available:
ACM Computing Surveys (CSUR), 2021. https://tinyurl.com/yskzrzah.
[42] E. Kaljic, A. Maric, P. Njemcevic, and M. Hadzialic, “A survey on data [72] Silicom, “FPGA SmartNIC FB2CDG1@AGM39D-2 Intel based.” [On-
plane flexibility and programmability in software-defined networking,” line]. Available: https://tinyurl.com/3rsbur47.
IEEE Access, 2019. [73] Silicom, “FPGA SmartNIC N6010/6011 Intel based.” [Online]. Avail-
[43] W. da Costa Cordeiro, J. Marques, and L. Gaspary, “Data plane able: https://tinyurl.com/3syps38s.
programmability beyond OpenFlow: Opportunities and challenges for [74] Silicom, “FB4XXVG@Z21D TimeSync SmartNIC FPGA Xilinx
network and service operations and management,” Journal of Network based.” [Online]. Available: https://tinyurl.com/4vdbp3jd.
and Systems Management, 2017. [75] NVIDIA, “Mellanox Innova-2 Flex Open Programmable SmartNIC.”
[44] Y. Gao and Z. Wang, “A review of P4 programmable data planes for [Online]. Available: https://tinyurl.com/3wdy3hxd.
network security,” Mobile Information Systems, 2021. [76] AMD, “Pensando Giglio Data Processing Unit.” [Online]. Available:
[45] A. AlSabeh, J. Khoury, E. Kfoury, J. Crichigno, and E. Bou-Harb, “A https://tinyurl.com/yst9b77m.
survey on security applications of P4 programmable switches and a [77] AMD, “Pensando DSC2-100 100G 2p QSFP56 DPU and DSC2-
STRIDE-based vulnerability assessment,” Computer networks, 2022. 25 10/25G 2p SFP56 DPU Distributed Services Cards for VMware
[46] X. Chen, C. Wu, X. Liu, Q. Huang, D. Zhang, H. Zhou, Q. Yang, and vSphere Distributed Services Engine.” [Online]. Available: https:
M. Khan, “Empowering network security with programmable switches: //tinyurl.com/38ax5jkb.
A comprehensive survey,” IEEE Communications Surveys & Tutorials, [78] Asterfusion, “Helium EC2004Y.” [Online]. Available: https://tinyurl.
2023. com/3bkpn6yv.
34
[79] Asterfusion, “Helium ec2002p.” [Online]. Available: https://tinyurl.co [116] D. Basak, R. Toshniwal, S. Maskalik, and A. Sequeira, “Virtualizing
m/psfr4w6d. networking and security in the cloud,” ACM SIGOPS Operating Sys-
[80] Broadcom, “Stingray PS225 SmartNIC Adapters.” [Online]. Available: tems Review, 2010.
https://tinyurl.com/5f3rpu45. [117] NVIDIA, “DOCA Open vSwitch Layer-4 Firewall.” [Online]. Avail-
[81] Intel, “Infrastructure Processing Unit (Intel IPU) ASIC E2000.” [On- able: https://tinyurl.com/bdfctkaj.
line]. Available: https://tinyurl.com/5d3rbjfb. [118] AMD, “Achieve high throughput: A case study using a Pensando
[82] Marvell, “Marvell LiquidIO III.” [Online]. Available: https://tinyurl.co distributed services card with P4 programmable software-defined net-
m/a7r69vpc. working pipeline.” [Online]. Available: https://tinyurl.com/yj9ttvnh.
[83] Netronome, “Agilio FX 2x10GbE SmartNIC.” [Online]. Available: [119] The Zeek Project, “Zeek, an open source network security monitoring
https://tinyurl.com/28sxth97. tool.” [Online]. Available: https://zeek.org/.
[84] Netronome, “Agilio CX 2x40GbE SmartNIC.” [Online]. Available: [120] The Open Information Security Foundation, “Suricata.” [Online].
https://tinyurl.com/mfpud4pd. Available: https://suricata.io/.
[85] NVIDIA, “NVIDIA BlueField-3 Networking Platform.” [Online]. [121] Cisco, “Snort - network intrusion detection and prevention system.”
Available: https://tinyurl.com/3e5v2xd2. [Online]. Available: https://www.snort.org/.
[86] AMD, “Xilinx Alveo U25N SmartNIC.” [Online]. Available: https: [122] Z. Zhao, H. Sadok, N. Atre, J. Hoe, V. Sekar, and J. Sherry, “Achieving
//tinyurl.com/2dwz7dxe. 100Gbps intrusion prevention on a single server,” in 14th USENIX
[87] AMD, “Alveo U45N Data Center Accelerator Card.” [Online]. Avail- Symposium on Operating Systems Design and Implementation (OSDI
able: https://tinyurl.com/mvtbshy3. 20), 2020.
[88] Intel, “FPGA Product Catalog.” [Online]. Available: https://tinyurl.co [123] J. Chen, X. Zhang, T. Wang, Y. Zhang, T. Chen, J. Chen, M. Xie,
m/ykvxkj3c. and Q. Liu, “Fidas: Fortifying the cloud via comprehensive fpga-based
[89] Napatech, “SmartNIC and IPU Hardware Portfolio.” [Online]. Avail- offloading for intrusion detection: Industrial product,” in Proceedings
able: https://tinyurl.com/yxcbx2p9. of the 49th Annual International Symposium on Computer Architecture,
[90] M. Liu, T. Cui, H. Schuh, A. Krishnamurthy, S. Peter, and K. Gupta, 2022.
“Offloading distributed applications onto SmartNICs using ipipe,” in [124] Y. Zhao, G. Cheng, Y. Duan, Z. Gu, Y. Zhou, and L. Tang, “Secure IoT
Proceedings of the ACM Special Interest Group on Data Communica- edge: Threat situation awareness based on network traffic,” Computer
tion, 2019. Networks, 2021.
[91] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, [125] S. Panda, Y. Feng, S. Kulkarni, K. Ramakrishnan, N. Duffield, and
C. Schlesinger, D. Talayco, A. Vahdat, and G. Varghese, “P4: Pro- L. Bhuyan, “SmartWatch: Accurate traffic analysis and flow-state
gramming protocol-independent packet processors,” ACM SIGCOMM tracking for intrusion prevention using smartnics,” in Proceedings of the
Computer Communication Review, 2014. 17th International Conference on Emerging Networking Experiments
[92] The P4 Language Consortium, “P4 14 language specification.” [On- and Technologies, 2021.
line]. Available: https://tinyurl.com/hzujjzt7. [126] M. Wu, H. Matsutani, and M. Kondo, “ONLAD-IDS: ONLAD-based
[93] The P4 Language Consortium, “P4 16 language specification.” [On- intrusion detection system using SmartNIC,” in 2022 IEEE 24th Int
line]. Available: https://tinyurl.com/5fvfnd8t. Conf on High Performance Computing & Communications, 2022.
[94] “P4 Portable NIC Architecture (PNA).” [Online]. Available: https:
[127] K. Tasdemir, R. Khan, F. Siddiqui, S. Sezer, F. Kurugollu, and A. Bolat,
//tinyurl.com/3v6etke2.
“An investigation of machine learning algorithms for high-bandwidth
[95] AMD, “Xilinx Vivado Design Suite 2023.” [Online]. Available: https:
SQL injection detection utilising BlueField-3 DPU technology,” in
//www.xilinx.com.
2023 IEEE 36th International System-on-Chip Conference (SOCC),
[96] Intel, “Intel P4 Suite for FPGA.” [Online]. Available: https://tinyurl.
2023.
com/42rztah2.
[128] S. Miano, R. Doriguzzi-Corin, F. Risso, D. Siracusa, and R. Sommese,
[97] The Linux Foundation, “DPDK Supported Hardware.” [Online].
“Introducing SmartNICs in server-based data plane processing: The
Available: https://core.dpdk.org/supported/.
DDoS mitigation use case,” IEEE Access, 2019.
[98] The Linux Foundation, “DPDK Pipeline Application.” [Online].
Available: https://tinyurl.com/udutp3jf. [129] The Open Information Security Foundation, “Ignoring traffic.” [On-
[99] The Linux Foundation, “Generic flow API (rte flow) documentation.” line]. Available: https://tinyurl.com/f2kn3snm.
[Online]. Available: https://tinyurl.com/3pwwnnx2. [130] M. Gonen, “Accelerating the Suricata IDS/IPS with NVIDIA BlueField
[100] S. Horman, “OvS hardware offload with TC flower,” in Proceedings DPUs.” [Online]. Available: https://tinyurl.com/ys8n6mmz.
Open vSwitch 2017 Fall Conf. [131] R. Yavatkar, “SmartNICs accelerate the new network edge.” [Online].
[101] NVIDIA, “DOCA Flow.” [Online]. Available: https://tinyurl.com/bdfx Available: https://tinyurl.com/2af6yfp3.
7u98. [132] M. Ceška, V. Havlena, L. Holı́k, J. Korenek, O. Lengál, D. Matoušek,
[102] NVIDIA, “Mellanox ASAP2 Accelerated Switching and Packet Pro- J. Matoušek, J. Semric, and T. Vojnar, “Deep packet inspection in
cessing,” ConnectX and ASAP2 -Accelerated Switcha and Packet Pro- FPGAs via approximate nondeterministic automata,” in 2019 IEEE
cessing, 2019. 27th Annual International Symposium on Field-Programmable Custom
[103] NVIDIA, “DOCA Developer Guide.” [Online]. Available: https://tiny Computing Machines (FCCM), 2019.
url.com/2usa47hs. [133] Y. Yang and V. Prasanna, “High-performance and compact architecture
[104] Intel, “P4 insight.” [Online]. Available: https://tinyurl.com/2v2xajrf. for regular expression matching on FPGA,” IEEE Transactions on
[105] AMD, “Xilinx Vitis Networking P4.” [Online]. Available: https://tiny Computers, 2011.
url.com/bdctjc9b. [134] D. Matoušek, J. Kořenek, and V. Puš, “High-speed regular expression
[106] AMD, “Xilinx XRT and Vitis Platform Overview.” [Online]. Available: matching with pipelined automata,” in 2016 International Conference
https://tinyurl.com/y5jdsypx. on Field-Programmable Technology (FPT), 2016.
[107] Intel, “Intel Open FPGA Stack.” [Online]. Available: https://www.inte [135] D. Luchaup, L. De Carli, S. Jha, and E. Bach, “Deep packet inspection
l.com/. with DFA-trees and parametrized language overapproximation,” in
[108] The Linux Foundation, “Open Programmable Infrastructure Project.” IEEE INFOCOM 2014-IEEE Conference on Computer Communica-
[Online]. Available: https://opiproject.org/. tions, 2014.
[109] The Linux Foundation, “IPDK Documentation.” [Online]. Available: [136] M. Češka, V. Havlena, L. Holı́k, O. Lengál, and T. Vojnar, “Approx-
https://ipdk.io/documentation/. imate reduction of finite automata for high-speed network intrusion
[110] The Linux Foundation, “Sonic-dash.” [Online]. Available: https://tiny detection,” International Journal on Software Tools for Technology
url.com/utcjchme. Transfer, 2020.
[111] L. Xin, “SONiC, Programmability & Acceleration,” 2022. [Online]. [137] N. Diamond, S. Graham, and G. Clark, “Securing InfiniBand traffic
Available: https://tinyurl.com/musxey96. with BlueField-2 data processing units,” in International Conference
[112] J. Thönes, “Microservices,” IEEE software, 2015. on Critical Infrastructure Protection, 2022.
[113] T. Benson, A. Akella, and D. Maltz, “Network traffic characteristics of [138] Q. Su, S. Wu, Z. Niu, R. Shu, P. Cheng, Y. Xiong, C. Xue, Z. Liu, and
data centers in the wild,” in Proceedings of the 10th ACM SIGCOMM H. Xu, “Meili: Enabling SmartNIC as a service in the cloud,” arXiv
conference on Internet measurement, 2010. preprint arXiv:2312.11871, 2023.
[114] Cisco, “Cisco global cloud index 2015–2020.” [Online]. Available: [139] T. T. Bar Tuaf, Tal Gilboa, “kTLS offload performance enhancements
https://tinyurl.com/2ery68x4. for real-life applications,” 2020. [Online]. Available: https://tinyurl.co
[115] V. Stafford, “Zero trust architecture,” NIST special publication, 2020. m/24ep7pwc.
35
[140] D. Kim, S. Lee, and K. Park, “A case for smartnic-accelerated private [169] R. Durner, A. Varasteh, M. Stephan, C. Machuca, and W. Kellerer,
communication,” in Proceedings of the 4th Asia-Pacific Workshop on “HNLB: Utilizing hardware matching capabilities of NICs for offload-
Networking, pp. 30–35, 2020. ing stateful load balancers,” in ICC 2019-2019 IEEE International
[141] F. Novais and F. L. Verdi, “Unlocking security to the board: An Conference on Communications (ICC), 2019.
evaluation of SmartNIC-driven TLS acceleration with kTLS.” [Online]. [170] Y. Zhang, J. Bi, Z. Li, Y. Zhou, and Y. Wang, “VMS: Load balancing
Available: https://tinyurl.com/2p92nsnj. based on the virtual switch layer in datacenter networks,” IEEE Journal
[142] J. Zhao, M. Neves, and I. Haque, “On the (dis) advantages of on Selected Areas in Communications, 2020.
programmable NICs for network security services,” in 2023 IFIP [171] H. Krawczyk, “New hash functions for message authentication,” in
Networking Conference (IFIP Networking), 2023. International Conference on the Theory and Applications of Crypto-
[143] B. Pfaff, J. Pettit, K. Amidon, M. Casado, T. Koponen, and S. Shenker, graphic Techniques, 1995.
“Extending networking into the virtualization layer.,” in Hotnets, 2009. [172] The Linux Foundation, “Scaling in the Linux networking stack.”
[144] P. Emmerich, D. Raumer, F. Wohlfart, and G. Carle, “Performance [Online]. Available: https://tinyurl.com/4fjv42hj.
characteristics of virtual switching,” in 2014 IEEE 3rd International [173] Napatech, “5G user plane function offload.” [Online]. Available: https:
Conference on Cloud Networking (CloudNet), 2014. //tinyurl.com/4jxxeh8t.
[145] W. Tu, Y.-H. Wei, G. Antichi, and B. Pfaff, “Revisiting the open [174] R. Davis, “NVIDIA BlueField partner’s DPU storage solutions and use
vswitch dataplane ten years later,” in Proceedings of the 2021 ACM cases.” [Online]. Available: https://tinyurl.com/2s4kmkrp.
SIGCOMM 2021 Conference, pp. 245–257, 2021. [175] Y. Li, A. Kashyap, Y. Guo, and X. Lu, “Characterizing lossy and
[146] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme, lossless compression on emerging BlueField DPU architectures,” in
J. Gross, A. Wang, J. Stringer, and P. Shelar, “The design and 2023 IEEE Symposium on High-Performance Interconnects (HOTI),
implementation of open vSwitch,” in 12th USENIX symposium on 2023.
networked systems design and implementation (NSDI 15), 2015. [176] L. Peter, “DEFLATE compressed data format specification version 1.3,”
[147] VMware, “vSphere distributed switch.” [Online]. Available: https://ti RFC 1951, 1996.
nyurl.com/2bpwzubd. [177] L. Peter and J. Gailly, “ZLIB compressed data format specification
[148] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Sridhar, version 3.3,” RFC 1950, 1996.
M. Bursell, and C. Wright, “Virtual extensible local area network [178] X. Liang, K. Zhao, S. Di, S. Li, R. Underwood, A. Gok, J. Tian,
(VXLAN): A framework for overlaying virtualized layer 2 networks J. Deng, J. Calhoun, and D. Tao, “SZ3: A modular framework for
over layer 3 networks,” RFC 7348, 2014. composing prediction-based error-bounded lossy compressors,” IEEE
[149] J. Gross, I. Ganga, and T. Sridhar, “Geneve: Generic network virtual- Transactions on Big Data, 2022.
ization encapsulation,” RFC 8926, 2020. [179] E. de Rothschild, “Ai insights - is the acceleration of the power of ai
[150] D. Farinacci, T. Li, S. Hanks, D. Meyer, and P. Traina, “Generic routing models a recent phenomenon?,” Jun 2023.
encapsulation (GRE),” RFC 2784, 2000. [180] Z. Ma, J. He, J. Qiu, H. Cao, Y. Wang, Z. Sun, L. Zheng, H. Wang,
[151] I. Burstein, “NVIDIA data center processing unit (DPU) architecture,” S. Tang, T. Zheng, et al., “Bagualu: targeting brain scale pretrained
in 2021 IEEE Hot Chips 33 Symposium (HCS), 2021. models with over 37 million cores,” in Proceedings of the 27th
[152] J. Weerasinghe and F. Abel, “On the cost of tunnel endpoint processing ACM SIGPLAN Symposium on Principles and Practice of Parallel
in overlay virtual networks,” in 2014 IEEE/ACM 7th International Programming, pp. 192–204, 2022.
Conference on Utility and Cloud Computing, 2014. [181] A. Moody, J. Fernandez, F. Petrini, and D. Panda, “Scalable nic-based
[153] L. Luo, “Towards converged smartnic architecture for bare metal & reduction on large-scale clusters. supercomputing, 2003 acm,” in IEEE
public clouds,” APNet 2018 Industry Talks. Conference, 2003.
[154] NVIDIA, “Virtual switch on DPU.” [Online]. Available: https://tinyur [182] A. S. Da Silva, J. A. Wickboldt, L. Z. Granville, and A. Schaeffer-Filho,
l.com/5n8eb6bz. “Atlantic: A framework for anomaly traffic detection, classification,
[155] B. Claise, “Cisco systems NetFlow services export version 9,” tech. and mitigation in sdn,” in NOMS 2016-2016 IEEE/IFIP Network
rep., 2004. Operations and Management Symposium, pp. 27–35, IEEE, 2016.
[156] B. Claise, M. Fullmer, P. Calato, and R. Penno, “IPFIX protocol [183] T. Itsubo, M. Koibuchi, H. Amano, and H. Matsutani, “Accelerating
specification,” Interrnet-draft, work in progress, 2005. deep learning using multiple gpus and fpga-based 10gbe switch,” in
[157] The P4 Working Group, “In-band network telemetry (INT) dataplane 2020 28th Euromicro International Conference on Parallel, Distributed
specification.” [Online]. Available: https://tinyurl.com/4x9shr45. and Network-Based Processing (PDP), pp. 102–109, IEEE, 2020.
[158] F. Brockners, S. Bhandari, D. Bernier, and T. Mizrahi, “In situ op- [184] K. Tanaka, Y. Arikawa, T. Ito, K. Morita, N. Nemoto, F. Miura,
erations, administration, and maintenance (IOAM) deployment,” tech. K. Terada, J. Teramoto, and T. Sakamoto, “Communication-efficient
rep., 2023. distributed deep learning with gpu-fpga heterogeneous computing,” in
[159] G. Cormode and S. Muthukrishnan, “An improved data stream sum- 2020 IEEE Symposium on High-Performance Interconnects (HOTI),
mary: the count-min sketch and its applications,” Journal of Algorithms, pp. 43–46, IEEE, 2020.
2005. [185] E. de Rothschild, “Nvidia doca allreduce application guide,” Jun 2023.
[160] B. Bloom, “Space/time trade-offs in hash coding with allowable errors,” [186] R. Ma, E. Georganas, A. Heinecke, S. Gribok, A. Boutros, and
Communications of the ACM, 1970. E. Nurvitadhi, “Fpga-based ai smart nics for scalable distributed ai
[161] S. Geravand and M. Ahmadi, “Bloom filter applications in network training systems,” IEEE Computer Architecture Letters, vol. 21, no. 2,
security: A state-of-the-art survey,” Computer Networks, 2013. pp. 49–52, 2022.
[162] Z. Zeng, L. Cui, M. Qian, Z. Zhang, and K. Wei, “A survey on sliding [187] Z. Xiong and N. Zilberman, “Do switches dream of machine learning?
window sketch for network measurement,” Computer Networks, 2023. toward in-network classification,” in Proceedings of the 18th ACM
[163] J. White, J. Kim, M. Baldi, Y. Li, and D. McIntyre, “xPU accelerator workshop on hot topics in networks, pp. 25–33, 2019.
offload functions.” [Online]. Available: https://tinyurl.com/rzyfx5b4. [188] S. Ibanez, G. Brebner, N. McKeown, and N. Zilberman, “The p4-¿
[164] T. Cui, C. Zhao, W. Zhang, K. Zhang, and A. Krishnamurthy, “La- netfpga workflow for line-rate packet processing,” in Proceedings of the
conic: Streamlined load balancers for SmartNICs,” arXiv preprint 2019 ACM/SIGDA International Symposium on Field-Programmable
arXiv:2403.11411, 2024. Gate Arrays, pp. 1–9, 2019.
[165] X. Huang, Z. Guo, and M. Song, “FGLB: A fine-grained hardware [189] B. M. Xavier, R. S. Guimarães, G. Comarela, and M. Martinello,
intra-server load balancer based on 100 G FPGA SmartNIC,” Interna- “Programmable switches for in-networking classification,” in IEEE
tional Journal of Network Management, 2022. INFOCOM 2021-IEEE Conference on Computer Communications,
[166] B. Chang, A. Akella, L. D’Antoni, and K. Subramanian, “Learned pp. 1–10, IEEE, 2021.
load balancing,” in Proceedings of the 24th International Conference [190] D. Sanvito, G. Siracusano, and R. Bifulco, “Can the network be the
on Distributed Computing and Networking, pp. 177–187, 2023. ai accelerator?,” in Proceedings of the 2018 Morning Workshop on
[167] Z. Ni, C. Wei, T. Wood, and N. Choi, “A SmartNIC-based load In-Network Computing, pp. 20–25, 2018.
balancing and auto scaling framework for middlebox edge server,” [191] Redis, “The real-time data platform,” 2024. [Online]. Available: https:
in 2021 IEEE Conference on Network Function Virtualization and //redis.io/.
Software Defined Networks (NFV-SDN), 2021. [192] Danga Interactive, “Memcached - a distributed memory object caching
[168] H. Tajbakhsh, R. Parizotto, M. Neves, A. Schaeffer-Filho, and I. Haque, system.” [Online]. Available: https://memcached.org/.
“Accelerator-aware in-network load balancing for improved application [193] A. Dragojević, D. Narayanan, M. Castro, and O. Hodson, “FaRM: Fast
performance,” in 2022 IFIP Networking Conference (IFIP Networking), remote memory,” in 11th USENIX Symposium on Networked Systems
IEEE. Design and Implementation (NSDI 14), 2014.
36
[194] C. Mitchell, Y. Geng, and J. Li, “Using one-sided RDMA reads to [220] NVIDIA, “Introduction to DOCA for DPUs.” [Online]. Available: ht
build a fast, CPU-efficient key-value store,” in 2013 USENIX Annual tps://tinyurl.com/4tux5eb9.
Technical Conference (USENIX ATC 13), 2013.
[195] A. Kalia, M. Kaminsky, and D. Andersen, “Using RDMA efficiently
for key-value services,” in Proceedings of the 2014 ACM Conference
on SIGCOMM, 2014.
[196] A. Kalia, M. Kaminsky, and D. Andersen, “Design guidelines for
high performance RDMA systems,” in 2016 USENIX Annual Technical
Conference (USENIX ATC 16), 2016. Elie Kfoury received the Ph.D. degree in Informat-
[197] B. Cassell, T. Szepesi, B. Wong, T. Brecht, J. Ma, and X. Liu, ics from the University of South Carolina (USC),
“Nessie: A decoupled, client-driven key-value store using RDMA,” in 2023. He is currently an assistant professor in
IEEE Transactions on Parallel and Distributed Systems, 2017. the Integrated Information Technology department
[198] S. Sun, R. Zhang, M. Yan, and J. Wu, “SKV: A SmartNIC-offloaded at USC. As a member of the Cyberinfrastructure
distributed key-value store,” in 2022 IEEE International Conference on Laboratory, he developed training materials using
Cluster Computing (CLUSTER), 2022. virtual labs on high-speed networks, TCP congestion
[199] J. Liu, A. Dragojević, S. Flemming, A. Katsarakis, D. Korolija, control, programmable switches, SDN, and cyber-
I. Zablotchi, H. Ng, A. Kalia, and M. Castro, “Honeycomb: ordered security. He is the co-author a book “High-Speed
key-value store acceleration on an FPGA-based SmartNIC,” IEEE Networks: A Tutorial”, that is being used nationally
Transactions on Computers, 2023. for deploying, troubleshooting, and tuning Science
[200] A. Kalia, M. Kaminsky, and D. Andersen, “Datacenter RPCs can be DMZ networks. His research interests include P4 programmable data planes,
general and fast,” in 16th USENIX Symposium on Networked Systems computer networks, cybersecurity, and Blockchain. He previously worked as
Design and Implementation (NSDI 19), 2019. a research and teaching assistant in the computer science department at the
[201] C. Chen, Hungand Chang and S. Hung, “HKVS: a framework for American University of Science and Technology in Beirut.
designing a high throughput heterogeneous key-value store with Smart-
NIC and RDMA,” in Proceedings of the Conference on Research in
Adaptive and Convergent Systems, 2022.
[202] J. Li, Y. Lu, Q. Wang, J. Lin, Z. Yang, and J. Shu, “AlNiCo SmartNIC-
accelerated contention-aware request scheduling for transaction pro-
cessing,” in 2022 USENIX Annual Technical Conference (USENIX ATC
22), 2022. Samia Choueiri is a Ph.D. student in the College
[203] H. Schuh, W. Liang, M. Liu, J. Nelson, and A. Krishnamurthy, “Xenic: of Engineering and Computing at the University of
SmartNIC-accelerated distributed transactions,” in Proceedings of the South Carolina (USC). Her research interests include
ACM SIGOPS 28th Symposium on Operating Systems Principles, 2021. SmartNICs, P4 switches, cybersecurity, and robotics.
[204] S. Choi, M. Shahbaz, B. Prabhakar, and M. Rosenblum, “λ-nic: She received her Masters in Computer and Commu-
Interactive serverless compute on programmable smartnics,” in 2020 nications Engineering with emphasis in Mechatron-
IEEE 40th International Conference on Distributed Computing Systems ics Engineering from the American University of
(ICDCS), 2020. Science and Technology in Beirut, where she also
[205] Amazon, “Serverless function, FaaS service, AWS lambda.” was a teaching assistant and lab instructor.
[206] Google, “Google cloud functions.” [Online]. Available: https://tinyurl.
com/acayx98p.
[207] Microsoft, “Azure functions.” [Online]. Available: https://tinyurl.com/
a7wat88a.
[208] The Linux Foundation, “IPDK.” [Online]. Available: https://ipdk.io/.
[209] A. Zulfiqar, B. Pfaff, W. Tu, G. Antichi, and M. Shahbaz, “The
slow path needs an accelerator too!,” ACM SIGCOMM Computer
Communication Review, 2023.
Ali Mazloum is a Ph.D. student in the College
[210] Y. Le, H. Chang, S. Mukherjee, L. Wang, A. Akella, M. Swift, and
of Engineering and Computing at the University of
T. Lakshman, “UNO: Unifying host and smart NIC offload for flexible
South Carolina (USC) in the United States of Amer-
packet processing,” in Proceedings of the 2017 Symposium on Cloud
ica. Prior to joining USC, he received his bachelor’s
Computing, 2017.
in computer science from the American University
[211] S. Wang, Z. Meng, C. Sun, M. Wang, M. Xu, J. Bi, T. Yang, Q. Huang,
of Beirut (AUB). His research focuses on P4 pro-
and H. Hu, “SmartChain: Enabling high-performance service chain
grammable data planes, SmartNICs, cybersecurity,
partition between SmartNIC and CPU,” in ICC 2020-2020 IEEE
network measurements, and traffic engineering.
International Conference on Communications (ICC), 2020.
[212] Y. Zhou, M. Wilkening, J. Mickens, and M. Yu, “SmartNIC security
isolation in the cloud with S-NIC,” 2024.
[213] Y. Qiu, Q. Kang, M. Liu, and A. Chen, “Clara: Performance clarity
for SmartNIC offloading,” in Proceedings of the 19th ACM Workshop
on Hot Topics in Networks, 2020.
[214] The official ServeTheHome.com YouTube channel, “Servethehome.”
[Online]. Available: https://tinyurl.com/yc58uapm.
[215] The official SNIA YouTube channel, “SNIAVideo.” [Online]. Available: Ali AlSabeh is currently a Ph.D. student in the
https://tinyurl.com/3bhdb7kd. College of Engineering and Computing at the Uni-
[216] The official OPI YouTube channel, “The open programmable infras- versity of South Carolina, USA. He is a member
tructure.” [Online]. Available: https://www.youtube.com/@OPI proje of the CyberInfrastructure Lab (CI Lab), where he
ct. developed training materials for virtual labs on net-
[217] K. A. Simpson and D. P. Pezaros, “Revisiting the classics: Online rl in work protocols (BGP, OSPF) and their applications
the programmable dataplane,” in NOMS 2022-2022 IEEE/IFIP Network (BGP attributes, BGP hijacking, IP spoofing, etc.),
Operations and Management Symposium, pp. 1–10, IEEE, 2022. as well as SDN (OpenFlow, interconnecting SDN
[218] J. Xing, K. Hsu, M. Kadosh, A. Lo, Y. Piasetzky, A. Krishnamurthy, with legacy networks, etc.). He previously earned his
and A. Chen, “Runtime programmable switches,” in 19th USENIX M.S. degree in Computer Science from the Ameri-
Symposium on Networked Systems Design and Implementation (NSDI can University of Beirut, where he also worked as a
22), 2022. graduate research assistant and teacher assistant. His area of research focuses
[219] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger, on malware analysis, network security, and P4 programmable switches.
G. Mendelson, M. Alizadeh, S. Chuang, and I. Keslassy, “DRMT: Dis-
aggregated programmable switching,” in Proceedings of the Conference
of the ACM Special Interest Group on Data Communication, 2017.
37