sk_buff) that carries packet data through every layer, the network device (net_device) that represents a physical or virtual interface, and the NAPI poll mechanism that batches interrupt-driven packet processing for high throughput.
Socket Buffers
sk_buff allocation, manipulation, and lifecycleNetwork Devices
net_device registration and net_device_opsNAPI
Interrupt-driven poll for high-throughput Rx/Tx
Socket Buffer (sk_buff)
struct sk_buff (always referred to as skb) is the fundamental metadata structure for every packet in the Linux kernel. It does not itself hold packet data; instead it describes where the data lives via four pointers into a contiguous buffer.
sk_buff.head is the start of the allocation. sk_buff.data points to the first byte of valid packet data. sk_buff.tail marks the end of valid data. sk_buff.end is the end of the entire allocation, immediately followed by struct skb_shared_info.
Key sk_buff Fields
Data pointers
Data pointers
Start of the allocated buffer. Never moved after allocation. The space between
head and data is headroom, available for prepending protocol headers with skb_push().Pointer to the first byte of current packet data. Advanced by
skb_pull() on receive and retracted by skb_push() on transmit.Byte offset (on 64-bit) or pointer (on 32-bit) to the end of valid data. Extended by
skb_put() when appending data.End of the main buffer. The
struct skb_shared_info is stored immediately at this offset. Use skb_end_pointer() to obtain the address.Length fields
Length fields
Total length of packet data, including data in fragments (
data_len). Always equals tail - data plus data_len.Length of data held in
skb_shared_info page fragments and frag_list. Zero for packets with all data in the linear buffer.Length of the link-layer (MAC) header.
Total memory consumption of this skb, including the
sk_buff struct itself, linear data, and all fragments. Used for socket memory accounting.Protocol and device fields
Protocol and device fields
Packet protocol as seen by the driver (e.g.,
ETH_P_IP, ETH_P_IPV6). Set by the driver before calling netif_receive_skb().The network device this skb arrived on or is being sent out of. May be
NULL in some protocol-internal paths.The socket that owns this buffer (set when the packet is associated with a socket, e.g., during receive demuxing).
Checksum status. One of
CHECKSUM_NONE, CHECKSUM_UNNECESSARY, CHECKSUM_COMPLETE, or CHECKSUM_PARTIAL. Drivers advertising NETIF_F_RXCSUM set CHECKSUM_UNNECESSARY for verified packets.Packet class:
PACKET_HOST (addressed to this host), PACKET_BROADCAST, PACKET_MULTICAST, PACKET_OTHERHOST.Flow hash, used for RSS and load-balancing across queues.
Checksum and GSO fields
Checksum and GSO fields
Checksum value. Interpretation depends on
ip_summed: holds the full packet checksum when CHECKSUM_COMPLETE, or the pseudo-header checksum when CHECKSUM_PARTIAL.Offset from
skb->head at which checksum computation begins (used with CHECKSUM_PARTIAL).Offset from
csum_start at which the computed checksum is to be stored.Allocation and Deallocation
- alloc_skb
- dev_alloc_skb
- kfree_skb / consume_skb
sk_buff with a linear data buffer of size bytes. The priority argument is passed directly to the page allocator (e.g., GFP_KERNEL, GFP_ATOMIC). Use GFP_ATOMIC in interrupt context or when holding a spinlock.Number of bytes to allocate for the linear data buffer. The actual allocation is rounded up to
SMP_CACHE_BYTES alignment. Does not include struct skb_shared_info, which is appended automatically.GFP allocation flags. Use
GFP_ATOMIC in non-sleepable contexts (interrupt handlers, softirq). Use GFP_KERNEL when it is safe to sleep.sk_buff on success, or NULL on allocation failure.Buffer Manipulation
These functions adjust thedata and tail pointers to add or remove protocol headers as the packet traverses the stack:
skb_reserve — allocate headroom
skb_reserve — allocate headroom
skb->data and skb->tail by len bytes before any data is placed in the buffer. Must be called on a freshly allocated, empty skb. Creates headroom for protocol headers that will be prepended later with skb_push().skb_put — append data at the tail
skb_put — append data at the tail
len bytes at the tail, advancing skb->tail and increasing skb->len. Returns a pointer to the start of the newly added region. skb_put_zero() zero-initializes the region. skb_put_data() also copies len bytes from data.skb_push — prepend a header
skb_push — prepend a header
skb->data by len bytes, exposing len bytes of headroom for a new protocol header. Increases skb->len. Returns the new skb->data. Used by each protocol layer to prepend its header as the packet travels down the stack toward the driver.skb_pull — strip a header
skb_pull — strip a header
skb->data by len bytes, effectively stripping len bytes from the front of the packet. Decreases skb->len. Returns the new skb->data, or NULL if len exceeds skb->len. Used on the receive path as each layer consumes its header.Network Device (net_device)
struct net_device represents every network interface visible to the kernel — physical NICs, virtual interfaces (loopback, VLANs, bridges), and tunnel endpoints. Drivers register a net_device to attach to the networking stack.
Registration
Allocate a net_device
alloc_etherdev() is equivalent to alloc_netdev() with ether_setup as the setup callback, which fills in Ethernet defaults (MTU=1500, type=ARPHRD_ETHER, addr_len=6, and standard features).Register the device
ip link) and can be brought up with IFF_UP. Returns 0 on success, negative errno on failure.register_netdev() must be called only after all fields and callbacks are fully initialized. The device may begin receiving traffic immediately upon return.net_device_ops
The net_device_ops structure contains the driver callbacks that implement network device behavior. Set dev->netdev_ops to a const struct net_device_ops before calling register_netdev().
Transmit callbacks
Transmit callbacks
Required. Called by the kernel to transmit a packet. The driver takes ownership of the
skb and must either hand it to hardware and call dev_consume_skb_any() on completion, or drop it with dev_kfree_skb_any(). Must return NETDEV_TX_OK or NETDEV_TX_BUSY.Optional. Select which transmit queue to use for a given skb. If absent, the kernel uses its default multiqueue selection algorithm based on the skb’s flow hash.
Called by the watchdog when a transmit queue has been stopped for longer than
dev->watchdog_timeo jiffies. The driver should reset the hardware and restart the queue.State and configuration callbacks
State and configuration callbacks
Called when the interface is brought up (
IFF_UP). The driver should allocate DMA rings, enable interrupts, start the hardware, and call netif_start_queue() or netif_tx_start_all_queues().Called when the interface is brought down. Disable interrupts, stop DMA, and drain queues. The NAPI instance must be disabled here with
napi_disable() before freeing any rings.Fill in 64-bit per-interface statistics. Preferred over the deprecated
ndo_get_stats for all new drivers.Optional. Change the device MAC address. The argument is a
struct sockaddr *. Return -EADDRNOTAVAIL if the address is invalid for the device type.Update the device’s multicast filter list and promiscuous mode based on
dev->flags and dev->mc.Queue Control
NAPI (New API)
NAPI is the event-handling mechanism for packet reception in the Linux network stack. Instead of processing each packet in a separate hardware interrupt, NAPI coalesces interrupt-driven notifications into batched software polls, dramatically reducing per-packet interrupt overhead at high packet rates. Operating model:- A hardware interrupt fires indicating new packets are available.
- The driver masks the interrupt and calls
napi_schedule(). - The kernel invokes the driver’s
pollmethod in softirq context. - The
pollmethod processes up tobudgetpackets and callsnapi_complete_done()when done, re-enabling the interrupt.
napi_struct
napi_struct per hardware receive queue in their private data structure.
Control API
Add a NAPI instance
net_device. The poll callback is called with a budget (usually 64 for the default weight) indicating the maximum number of Rx packets to process per invocation. Newly added instances start in the disabled state.The network device this NAPI instance belongs to. The instance is automatically removed when
dev is unregistered.Pointer to the
napi_struct embedded in the driver’s per-queue private data.The driver’s poll callback. Must process at most
budget receive packets, call napi_complete_done() when done, and return the actual number of packets processed.Desired budget. Typically
NAPI_POLL_WEIGHT (64). Use lower values for slow devices. Must not exceed NAPI_POLL_WEIGHT.Enable before opening the device
ndo_open() before enabling hardware interrupts. A disabled NAPI instance cannot be scheduled — any call to napi_schedule() while disabled is a no-op.Schedule from interrupt handler
napi_schedule() from the hardware interrupt handler to queue the NAPI instance for polling. The pattern is:Packet Reception Path
The path from hardware interrupt to socket receive buffer:netif_receive_skb() delivers the skb to the protocol handlers registered for skb->protocol. It runs through TC ingress, XDP (if loaded), netfilter hooks, and then dispatches to the appropriate L3 handler (e.g., ip_rcv() for IPv4).
napi_gro_receive() first passes the skb through GRO, which may merge it with other in-flight skbs of the same flow into a single larger skb before delivering to the stack. This reduces per-packet overhead for TCP workloads.
Before calling either function, set
skb->protocol (e.g., with eth_type_trans()), skb->dev, and configure skb->ip_summed if the device performed checksum offload.Packet Transmission Path
Packets enter the transmit path through the socket layer and travel down to the driver’sndo_start_xmit:
