net/, with driver implementations in drivers/net/.
Socket buffer (sk_buff)
The socket buffer (struct sk_buff, defined in include/linux/skbuff.h) is the fundamental data structure that carries packets through the entire networking stack. A single sk_buff represents one network packet and travels from the driver’s receive ring upward through L2/L3/L4 processing, or downward from the socket layer to transmission.
data and tail pointers, avoiding data copies as the packet traverses layers. Space is reserved at the head for lower-layer headers during transmission (skb_reserve), and headers are pushed down into that space as the packet descends.
The
sk_buff uses a reference count (skb->users). Cloning with skb_clone() creates a new sk_buff that shares the data area. Copying with skb_copy() produces a fully independent buffer.Protocol layers
The stack is divided into discrete processing layers. Each layer hands thesk_buff to the next via a well-defined function call.
- L2 — Link layer
- L3 — Network layer
- L4 — Transport layer
The link layer handles framing and addressing on the local network segment. The kernel’s L2 entry point for received frames is
netif_receive_skb() → __netif_receive_skb_core(), which dispatches based on skb->protocol (e.g., ETH_P_IP, ETH_P_IPV6, ETH_P_ARP).Ethernet bridging (net/bridge/) and VLANs operate at this layer.Netfilter and iptables hooks
Netfilter (net/netfilter/) provides a framework of hooks at key points in the packet path. Kernel modules register callbacks at these hooks to implement firewalling, NAT, connection tracking, and packet mangling.
setsockopt-based API (iptables) or a dedicated Netlink family (nftables).
nf_conntrack) maintains a table of established connections, allowing stateful filtering and NAT session management.
eBPF and XDP
eBPF (extended Berkeley Packet Filter) lets user-supplied programs run safely inside the kernel at various hook points, verified by the kernel’s in-kernel verifier before execution. XDP (eXpress Data Path) attaches eBPF programs directly to a network driver’s receive path, running before thesk_buff is allocated — achieving near-line-rate packet processing.
| Action | Meaning |
|---|---|
XDP_PASS | Hand packet to the normal network stack |
XDP_DROP | Drop the packet immediately |
XDP_TX | Transmit the packet back out the same interface |
XDP_REDIRECT | Redirect to another interface or CPU queue |
XDP_ABORTED | Drop and generate a trace event |
Network device drivers and NAPI
Network drivers use the NAPI (New API) interface (include/linux/netdevice.h) to batch packet processing and reduce interrupt overhead at high packet rates.
Interrupt fires
The NIC raises a hardware interrupt when packets arrive. The driver’s interrupt handler disables further NIC interrupts and schedules a NAPI poll.
NAPI poll loop
The kernel calls the driver’s
poll() method in a softirq context, allowing it to drain a batch of up to budget packets from the hardware ring buffer.TCP/IP stack details
TCP send path
TCP send path
Data flows:
write(2) → tcp_sendmsg() → TCP segmentation → ip_queue_xmit() → Netfilter POST_ROUTING → device queue → driver TX ring.TCP maintains a send buffer (sk->sk_sndbuf) and congestion window (tp->snd_cwnd). The minimum of the two limits how much data can be in flight.TCP receive path
TCP receive path
Packets arrive: driver →
netif_receive_skb() → ip_rcv() → tcp_v4_rcv() → socket receive queue → read(2) in user space.Out-of-order packets are held in the OOO queue (tp->out_of_order_queue) and spliced into the receive buffer when their sequence numbers are reached.TCP offloads
TCP offloads
Modern NICs can offload TCP segmentation (TSO), checksum computation (TX/RX csum offload), and large receive coalescing (LRO/GRO) to hardware, dramatically reducing CPU overhead at high throughput.
Network namespaces
Network namespaces (net/core/net_namespace.c) provide isolated network stacks. Each namespace has its own interfaces, routing tables, iptables rules, and socket table. Containers (Docker, Kubernetes pods) rely heavily on network namespaces for isolation.
The initial network namespace is
init_net. All network sysctl knobs under /proc/sys/net/ are per-namespace, allowing containers to have independent TCP buffer sizes, forwarding settings, and so on.