Kernel-bypass techniques for
high-speed network packet processing
CS 744
Presenters: Rinku Shah, Priyanka Naik
{rinku, ppnaik}@cse.iitb.ac.in
Course Instructor: Prof. Umesh Bellur
Department of Computer Science & Engineering
Indian Institute of Technology Bombay
Outline
● The journey of a packet through the Linux network stack
● Need for kernel bypass techniques for packet processing
● Kernel-bypass techniques
○ User-space packet processing
■ Data Plane Development Kit (DPDK)
■ Netmap
○ User-space network stack
■ mTCP
● What’s trending?
2
Typical packet flow
TX RX
Application Application
Transport (L4) Transport (L4)
Network (L3) Network (L3)
Data link (L2) Data link (L2)
NIC driver NIC driver
NIC hardware NIC hardware
3
What does a packet contain?
Ethernet header IP header TCP header payload FCS
dest src type
MAC MAC
src dst ... checksum ...
... length ... IP header src dst port port
type csum IP IP
FCS: Frame Check Sequence
4
Outline
● The journey of a packet through the Linux network stack
● Need for kernel bypass techniques for packet processing
● Kernel-bypass techniques
○ User-space packet processing
■ Data Plane Development Kit (DPDK)
■ Netmap
○ User-space network stack
■ mTCP
● What’s next??
5
RX path: Packet arrives at the destination NIC
User space
Applications NIC receives the packet
Kernel space ● Match destination MAC address
● Verify Ethernet checksum (FCS)
NIC driver
packet
buffer Packets accepted at the NIC
packet
Hardware
interrupt
buffer
TX RX ...
● DMA the packet to RX ring buffer
packet
buffer
● NIC triggers an interrupt
TX/RX rings
NIC ● Circular queue
● Shared between NIC and NIC driver
Hardware
RX queue
● Content: Length + packet buffer pointer
6
Interrupt processing in the linux kernel
● Top-half
○ Minimal processing
● Bottom-half
○ Rest of interrupt processing
7
Top-half interrupt processing
RX CPU interrupts the process in execution
Application
Switch from user space to kernel space
Transport (L4)
Network (L3) Top-half interrupt processing
Data link (L2) ● Lookup IDT (Interrupt Descriptor Table)
NIC driver ● Call corresponding ISR (Interrupt Service Routine)
○ Acknowledge the interrupt
NIC hardware
○ Schedule bottom-half processing
● Switch back to user space
8
Bottom-half processing
CPU initiates the bottom-half when it is free (soft-irq)
User space
Applications
Switch from user space to kernel space
Kernel space
s
k Driver dynamically allocates an sk-buff (a.k.a., skb)
NIC driver b
packet Oops!!
buffer
packet
Hardware
interrupt
buffer
TX RX ... sk-buff (sk-buff tutorial link)
packet
buffer In-memory data structure that contains packet metadata
● Pointers to packet headers and payload
● More packet related information ...
NIC
9
Bottom-half processing
User space
Applications NIC driver processing
Kernel space 1. Driver dynamically allocates an sk-buff
For all packets
s
in buffer
k 2. Update sk-buff with packet metadata
NIC driver b
packet 3. Remove the Ethernet header
buffer
packet
Hardware
4. Pass sk-buff to the network stack
interrupt
buffer
TX RX ...
packet
buffer
Call L3 protocol handler
NIC
10
L3/L4 processing
L3-specific processing
RX 1. Route lookup
Application 2. Combine fragmented packets
Common processing
Transport (L4) 3. Call L4 protocol handler
1. Match destination IP/socket
Network (L3)
2. Verify checksum L4-specific processing
Data link (L2)
3. Remove header
NIC driver
NIC hardware
11
L3/L4 processing
User space
Application L3-specific processing
Kernel space 1. Route lookup
2. Combine fragmented packets
Network stack W R
Q Q 3. Call L4 protocol handler
NIC driver skb
packet
L4-specific processing
buffer
Hardware
packet
interrupt
buffer 1. Handle TCP state machine
TX RX
...
packet 2. Enqueue to socket read queue
buffer
3. Signal the socket
NIC
12
Application processing
User space
Application
On socket read: user space to kernel space
Kernel space ● Dequeue packet from socket receive queue
System calls
W R (kernel space)
Network stack Q Q
skb ● Copy packet to application buffer (user space)
NIC driver
packet
● Release sk-buff
buffer
Hardware
packet
interrupt
buffer ● Return back to the application
TX RX
...
packet
buffer kernel space to user space
NIC
13
Transmit path of an application packet
User space
Application
Kernel space
System calls
On socket write: user space to kernel space
Network stack
● Writes the packet to the kernel buffer
NIC driver
packet ● Calls socket’s send function (e.g., sendmsg)
buffer
packet
Hardware
interrupt
buffer
RX TX ...
packet
buffer
NIC
14
L4/L3 processing
User space
Application L4-specific processing
1. Allocate sk-buff
Kernel space 2. Enqueue sk-buff to socket write queue
3. Call L3 protocol handler
Network stack W R
Q Q Common processing
NIC driver skb 1. Build header
2. Add header to packet buffer
packet
buffer 3. Update sk-buff
Hardware
packet
interrupt
buffer
RX TX
...
L3-specific processing
packet
buffer 1. Fragment, if needed
2. Call L2 protocol handler
NIC
15
L2 processing
User space
Application
Enqueue packet to queue discipline (qdisc)
Kernel space ● Hold packets in a queue
● Apply scheduling policies (e.g. FIFO, priority)
R W
Q Q
NIC driver skb
qdisc
● Dequeue sk-buff (if NIC has free buffers)
packet
buffer
● Post process sk-buff
Hardware
packet
interrupt
buffer
RX TX
...
qdisc ○ Calculate IP/TCP checksum
queue
packet
buffer ○ … (tasks that h/w cannot do)
● Call NIC driver’s send function
NIC
16
NIC processing
NIC driver
User space ● If hardware transmit queue full
Application ○ Stop qdisc queue
● Otherwise:
Kernel space ○ Map packet data for DMA
○ Tells NIC to send the packet
NIC driver
packet NIC
buffer
● Calculates ethernet frame checksum (FCS)
Hardware
packet
interrupt
buffer
RX TX
...
qdisc ● Sends packet to the wire
queue
packet ● Sends an interrupt “Packet is sent” (kernel
buffer
space to user space)
● Driver frees the sk-buff; starts the qdisc queue
NIC Hardware
TX queue
Transmit and receive packet processing pipeline DONE!!
17
Packet processing overheads in the kernel
● Too many context switches!!
○ Pollutes CPU cache
● Per-packet interrupt overhead
● Dynamic allocation of sk-buff
● Packet copy between kernel and user space
● Shared data structures
Cannot achieve line-rate for recent high speed NICs!! (40Gbps/100Gbps)
18
Optimizations to accelerate kernel packet processing
● NAPI (New API) Reading link
● GRO (Generic Receive Offload) GRO+GSO
● GSO (Generic Segmentation Offload) GRO+GSO with DPDK
● Use of multiple hardware queues Multiqueue NIC, Supplement: RSS+RPS+...
● ...
19
Outline
● The journey of a packet through the Linux network stack
● Need for kernel bypass techniques for packet processing
● Kernel-bypass techniques
○ User-space packet processing
■ Data Plane Development Kit (DPDK)
■ Netmap
○ User-space network stack
■ mTCP
● What’s trending?
20
Packet Processing Overheads in Kernel
● Context switch between kernel and userspace
Application read user space
Kernel kernel space
NIC
21
Packet Processing Overheads in Kernel
Application buffer
in userspace
● Context switch between kernel and userspace
Application read user space
● Packet copy between kernel and userspace
Buffer in kernel Kernel kernel space
memory
NIC
22
Packet Processing Overheads in Kernel
● Context switch between kernel and userspace
Application
● Packet copy between kernel and userspace
● Dynamic allocation of sk_buff
skb skb
● Per packet interrupt receive Kernel transmit
● Shared data structures
NIC
23
Overcome Overheads in Kernel: Bypass the kernel
L2-L4 packet
Application user space processing Application
Shared
Pre-allocated buffers user space
Packet processing
Kernel kernel space
NIC NIC
Context switch between kernel and userspace
Packet copy between kernel and userspace
Dynamic allocation of sk_buff
24
Interrupt vs Poll Mode
Interrupt Mode Poll Mode
CPU NIC CPU NIC
● NIC notifies it needs servicing ● CPU keeps checking the NIC
● Interrupt is a hardware mechanism ● Polling is done with help of control
● Handled using interrupt handler bits (Command-ready bit)
● Interrupt overhead for high speed ● Handled by the CPU
traffic ● Consumes CPU cycles but handles
● Interrupt for a batch of packets high speed traffic
25
Interrupt vs Poll Mode: Kernel bypass techniques
Interrupt Mode Poll Mode
CPU NIC CPU NIC
● NIC notifies it needs servicing ● CPU keeps checking the NIC
● Interrupt is a hardware mechanism ● Polling is done with help of control
● Handled using interrupt handler bits(Command-ready bit)
● Interrupt overhead for high speed ● Handled by the CPU
traffic ● Consumes CPU cycles but handles
high speed traffic
Netmap DPDK
26
Outline
● The journey of a packet through the Linux network stack
● Need for kernel bypass techniques for packet processing
● Kernel-bypass techniques
○ User-space packet processing
■ Data Plane Development Kit (DPDK)
■ Netmap
○ User-space network stack
■ mTCP
● What’s trending?
27
Intel Data Plane Development Kit (DPDK)
User Space
• Poll mode user space drivers (uio) Application
○ Unbinds NIC from kernel
• Mempool: HUGE pages to avoid TLB misses. rte_mbuf
• Rte_mbuf: metadata+ pkt buffer rte_ring rte_mempool
• Cooperative multiprocessing
○ Safe for trusted application Poll Mode Drivers
28
Kernel Space
DPDK NIC
Netmap
• Netmap Rings are memory regions in Application
kernel space shared between application
and kernel User Space
• No extra copy of a packet Sockets
• NIC can work with netmap as well as
kernel drivers (transparent mode) Kernel TCP
Stack
Netmap driver Drivers (ixgbe)
DPDK, netmap manage processing till
L2 of network stack Kernel Space
NIC
netmap 29
What about L3-L7 processing?
Application
● Overheads with L3-L7 processing in kernel
● Shared data structure
● Userspace network stack
○ Over netmap or DPDK
Kernel network
● mTCP: multicore TCP processing
Shared socket
and TCP data
structure
NIC
CPU core
30
Multiqueue NIC
Application
NIC Receive Side Scaling (RSS)
Hash of (src_ip, dst_ip, src_port, dst_port)
Incoming packet to NIC
Application
RX queue Cores
TX queue
31
mTCP: Userspace network stack
Application
● Designed for multicore scalable application
● Per core TCP data structures
Per core mTCP
○ E.g. accept queue, socket list thread
○ Lock free
○ Connection locality
netmap/ DPDK
● Leverages multiqueue support of NIC
Shared data structures NIC
Incoming packets
mTCP
32
Outline
● The journey of a packet through the Linux network stack
● Need for kernel bypass techniques for packet processing
● Kernel-bypass techniques
○ User-space packet processing
■ Data Plane Development Kit (DPDK)
■ Netmap
○ User-space network stack
■ mTCP
● What’s trending?
33
What’s trending?
● Offload application processing to the kernel
○ BPF (Berkeley Packet Filter)
○ eBPF (eXtended BPF) BPF+eBPF+XDP link-1, BPF+eBPF+XDP tutorial link-2
● Offload application processing to the NIC driver
○ XDP (eXpress DataPath) Sample apps for eBPF + XDP
● Offload application processing to programmable hardware
○ Programmable SmartNICs (NPU/DPU)
■ Netronome, Mellanox, Bluefield, Pensando Video on smartNIC architecture + Netronome
NIC specifics
○ Programmable FPGAs
■ Xilinx, Altera
○ Programmable hardware ASICs Programmable network: Intro video , Detailed video link
■ Barefoot Tofino, Cisco’s Doppler, Intel Flexpipe, Cavium’s Xpliant
34