Skip to main content
The Linux networking stack is built around three central abstractions: the socket buffer (sk_buff) that carries packet data through every layer, the network device (net_device) that represents a physical or virtual interface, and the NAPI poll mechanism that batches interrupt-driven packet processing for high throughput.

Socket Buffers

sk_buff allocation, manipulation, and lifecycle

Network Devices

net_device registration and net_device_ops

NAPI

Interrupt-driven poll for high-throughput Rx/Tx

Socket Buffer (sk_buff)

struct sk_buff (always referred to as skb) is the fundamental metadata structure for every packet in the Linux kernel. It does not itself hold packet data; instead it describes where the data lives via four pointers into a contiguous buffer.
                              ---------------
                             | sk_buff       |
                              ---------------
    ,---------------------------  + head
   /          ,-----------------  + data
  /          /      ,-----------  + tail
 |          |      |            , + end
 |          |      |           |
 v          v      v           v
  -----------------------------------------------
 | headroom | data |  tailroom | skb_shared_info |
  -----------------------------------------------
sk_buff.head is the start of the allocation. sk_buff.data points to the first byte of valid packet data. sk_buff.tail marks the end of valid data. sk_buff.end is the end of the entire allocation, immediately followed by struct skb_shared_info.

Key sk_buff Fields

head
unsigned char *
Start of the allocated buffer. Never moved after allocation. The space between head and data is headroom, available for prepending protocol headers with skb_push().
data
unsigned char *
Pointer to the first byte of current packet data. Advanced by skb_pull() on receive and retracted by skb_push() on transmit.
tail
sk_buff_data_t
Byte offset (on 64-bit) or pointer (on 32-bit) to the end of valid data. Extended by skb_put() when appending data.
end
sk_buff_data_t
End of the main buffer. The struct skb_shared_info is stored immediately at this offset. Use skb_end_pointer() to obtain the address.
len
unsigned int
Total length of packet data, including data in fragments (data_len). Always equals tail - data plus data_len.
data_len
unsigned int
Length of data held in skb_shared_info page fragments and frag_list. Zero for packets with all data in the linear buffer.
mac_len
__u16
Length of the link-layer (MAC) header.
truesize
unsigned int
Total memory consumption of this skb, including the sk_buff struct itself, linear data, and all fragments. Used for socket memory accounting.
protocol
__be16
Packet protocol as seen by the driver (e.g., ETH_P_IP, ETH_P_IPV6). Set by the driver before calling netif_receive_skb().
dev
struct net_device *
The network device this skb arrived on or is being sent out of. May be NULL in some protocol-internal paths.
sk
struct sock *
The socket that owns this buffer (set when the packet is associated with a socket, e.g., during receive demuxing).
ip_summed
__u8
Checksum status. One of CHECKSUM_NONE, CHECKSUM_UNNECESSARY, CHECKSUM_COMPLETE, or CHECKSUM_PARTIAL. Drivers advertising NETIF_F_RXCSUM set CHECKSUM_UNNECESSARY for verified packets.
pkt_type
__u8
Packet class: PACKET_HOST (addressed to this host), PACKET_BROADCAST, PACKET_MULTICAST, PACKET_OTHERHOST.
hash
__u32
Flow hash, used for RSS and load-balancing across queues.
csum
__wsum
Checksum value. Interpretation depends on ip_summed: holds the full packet checksum when CHECKSUM_COMPLETE, or the pseudo-header checksum when CHECKSUM_PARTIAL.
csum_start
__u16
Offset from skb->head at which checksum computation begins (used with CHECKSUM_PARTIAL).
csum_offset
__u16
Offset from csum_start at which the computed checksum is to be stored.

Allocation and Deallocation

struct sk_buff *alloc_skb(unsigned int size, gfp_t priority);
Allocates an sk_buff with a linear data buffer of size bytes. The priority argument is passed directly to the page allocator (e.g., GFP_KERNEL, GFP_ATOMIC). Use GFP_ATOMIC in interrupt context or when holding a spinlock.
size
unsigned int
required
Number of bytes to allocate for the linear data buffer. The actual allocation is rounded up to SMP_CACHE_BYTES alignment. Does not include struct skb_shared_info, which is appended automatically.
priority
gfp_t
required
GFP allocation flags. Use GFP_ATOMIC in non-sleepable contexts (interrupt handlers, softirq). Use GFP_KERNEL when it is safe to sleep.
Returns a pointer to the new sk_buff on success, or NULL on allocation failure.

Buffer Manipulation

These functions adjust the data and tail pointers to add or remove protocol headers as the packet traverses the stack:
void skb_reserve(struct sk_buff *skb, int len);
Advances skb->data and skb->tail by len bytes before any data is placed in the buffer. Must be called on a freshly allocated, empty skb. Creates headroom for protocol headers that will be prepended later with skb_push().
struct sk_buff *skb = dev_alloc_skb(MAX_HEADER + sizeof(payload));
skb_reserve(skb, MAX_HEADER);  /* leave room for L2/L3/L4 headers */
memcpy(skb_put(skb, sizeof(payload)), payload, sizeof(payload));
void *skb_put(struct sk_buff *skb, unsigned int len);
void *skb_put_zero(struct sk_buff *skb, unsigned int len);
void *skb_put_data(struct sk_buff *skb, const void *data, unsigned int len);
Extends the data area by len bytes at the tail, advancing skb->tail and increasing skb->len. Returns a pointer to the start of the newly added region. skb_put_zero() zero-initializes the region. skb_put_data() also copies len bytes from data.
skb_put() will BUG() if there is insufficient tailroom. Always verify available tailroom with skb_tailroom() first, or ensure the buffer was allocated with sufficient space.
void *skb_push(struct sk_buff *skb, unsigned int len);
Decrements skb->data by len bytes, exposing len bytes of headroom for a new protocol header. Increases skb->len. Returns the new skb->data. Used by each protocol layer to prepend its header as the packet travels down the stack toward the driver.
/* Prepend an Ethernet header */
struct ethhdr *eth = skb_push(skb, sizeof(*eth));
memcpy(eth->h_dest, dest_mac, ETH_ALEN);
void *skb_pull(struct sk_buff *skb, unsigned int len);
Advances skb->data by len bytes, effectively stripping len bytes from the front of the packet. Decreases skb->len. Returns the new skb->data, or NULL if len exceeds skb->len. Used on the receive path as each layer consumes its header.
/* Pull off the Ethernet header after processing */
if (!skb_pull(skb, sizeof(struct ethhdr)))
    goto drop;

Network Device (net_device)

struct net_device represents every network interface visible to the kernel — physical NICs, virtual interfaces (loopback, VLANs, bridges), and tunnel endpoints. Drivers register a net_device to attach to the networking stack.

Registration

1

Allocate a net_device

struct net_device *alloc_netdev(int sizeof_priv,
                                const char *name,
                                unsigned char name_assign_type,
                                void (*setup)(struct net_device *));

/* Ethernet-specific helper */
struct net_device *alloc_etherdev(int sizeof_priv);
alloc_etherdev() is equivalent to alloc_netdev() with ether_setup as the setup callback, which fills in Ethernet defaults (MTU=1500, type=ARPHRD_ETHER, addr_len=6, and standard features).
2

Fill in device fields and ops

struct my_priv *priv = netdev_priv(dev);
dev->netdev_ops = &my_netdev_ops;
dev->ethtool_ops = &my_ethtool_ops;
dev->features   |= NETIF_F_HW_CSUM | NETIF_F_SG;
SET_NETDEV_DEV(dev, &pdev->dev); /* set parent device for sysfs */
3

Register the device

int register_netdev(struct net_device *dev);
Registers the device with the networking core. On success the interface appears in the system (e.g., visible via ip link) and can be brought up with IFF_UP. Returns 0 on success, negative errno on failure.
register_netdev() must be called only after all fields and callbacks are fully initialized. The device may begin receiving traffic immediately upon return.
4

Unregister on removal

void unregister_netdev(struct net_device *dev);
void free_netdev(struct net_device *dev);
unregister_netdev() detaches the device from the stack and waits for all in-progress operations to complete. Call free_netdev() afterwards to release the net_device allocation. On PCIe driver removal, always unregister before calling pci_iounmap() or freeing DMA resources.

net_device_ops

The net_device_ops structure contains the driver callbacks that implement network device behavior. Set dev->netdev_ops to a const struct net_device_ops before calling register_netdev().
ndo_start_xmit
netdev_tx_t (*)(struct sk_buff *, struct net_device *)
Required. Called by the kernel to transmit a packet. The driver takes ownership of the skb and must either hand it to hardware and call dev_consume_skb_any() on completion, or drop it with dev_kfree_skb_any(). Must return NETDEV_TX_OK or NETDEV_TX_BUSY.
ndo_select_queue
u16 (*)(struct net_device *, struct sk_buff *, struct net_device *)
Optional. Select which transmit queue to use for a given skb. If absent, the kernel uses its default multiqueue selection algorithm based on the skb’s flow hash.
ndo_tx_timeout
void (*)(struct net_device *, unsigned int txqueue)
Called by the watchdog when a transmit queue has been stopped for longer than dev->watchdog_timeo jiffies. The driver should reset the hardware and restart the queue.
ndo_open
int (*)(struct net_device *)
Called when the interface is brought up (IFF_UP). The driver should allocate DMA rings, enable interrupts, start the hardware, and call netif_start_queue() or netif_tx_start_all_queues().
ndo_stop
int (*)(struct net_device *)
Called when the interface is brought down. Disable interrupts, stop DMA, and drain queues. The NAPI instance must be disabled here with napi_disable() before freeing any rings.
ndo_get_stats64
void (*)(struct net_device *, struct rtnl_link_stats64 *)
Fill in 64-bit per-interface statistics. Preferred over the deprecated ndo_get_stats for all new drivers.
ndo_set_mac_address
int (*)(struct net_device *, void *)
Optional. Change the device MAC address. The argument is a struct sockaddr *. Return -EADDRNOTAVAIL if the address is invalid for the device type.
ndo_set_rx_mode
void (*)(struct net_device *)
Update the device’s multicast filter list and promiscuous mode based on dev->flags and dev->mc.

Queue Control

/* Stop a specific transmit queue (e.g., when ring is full) */
void netif_stop_queue(struct net_device *dev);
void netif_tx_stop_queue(struct netdev_queue *txq);

/* Restart a stopped queue (e.g., after ring space freed) */
void netif_wake_queue(struct net_device *dev);
void netif_tx_wake_queue(struct netdev_queue *txq);

/* Start a queue (used in ndo_open, before traffic can flow) */
void netif_start_queue(struct net_device *dev);
void netif_tx_start_all_queues(struct net_device *dev);
netif_stop_queue() and netif_wake_queue() must be called with care from the transmit path. Stopping a queue signals the kernel scheduler not to call ndo_start_xmit again for that queue. Always pair each stop with a corresponding wake to avoid queue stalls.

NAPI (New API)

NAPI is the event-handling mechanism for packet reception in the Linux network stack. Instead of processing each packet in a separate hardware interrupt, NAPI coalesces interrupt-driven notifications into batched software polls, dramatically reducing per-packet interrupt overhead at high packet rates. Operating model:
  1. A hardware interrupt fires indicating new packets are available.
  2. The driver masks the interrupt and calls napi_schedule().
  3. The kernel invokes the driver’s poll method in softirq context.
  4. The poll method processes up to budget packets and calls napi_complete_done() when done, re-enabling the interrupt.

napi_struct

struct napi_struct {
    struct list_head    poll_list;  /* internal: queued on softirq poll list */
    unsigned long       state;      /* NAPI state bits */
    int                 weight;     /* max packets per poll (budget hint) */
    int (*poll)(struct napi_struct *, int); /* driver poll callback */
    /* ... additional internal fields ... */
};
Drivers embed one napi_struct per hardware receive queue in their private data structure.

Control API

1

Add a NAPI instance

void netif_napi_add(struct net_device *dev,
                    struct napi_struct *napi,
                    int (*poll)(struct napi_struct *, int),
                    int weight);
Registers a new NAPI instance for the given net_device. The poll callback is called with a budget (usually 64 for the default weight) indicating the maximum number of Rx packets to process per invocation. Newly added instances start in the disabled state.
dev
struct net_device *
required
The network device this NAPI instance belongs to. The instance is automatically removed when dev is unregistered.
napi
struct napi_struct *
required
Pointer to the napi_struct embedded in the driver’s per-queue private data.
poll
function pointer
required
The driver’s poll callback. Must process at most budget receive packets, call napi_complete_done() when done, and return the actual number of packets processed.
weight
int
required
Desired budget. Typically NAPI_POLL_WEIGHT (64). Use lower values for slow devices. Must not exceed NAPI_POLL_WEIGHT.
2

Enable before opening the device

void napi_enable(struct napi_struct *napi);
Transitions the NAPI instance from disabled to enabled. Call this in ndo_open() before enabling hardware interrupts. A disabled NAPI instance cannot be scheduled — any call to napi_schedule() while disabled is a no-op.
3

Schedule from interrupt handler

void napi_schedule(struct napi_struct *napi);
bool napi_schedule_prep(struct napi_struct *napi);
void __napi_schedule(struct napi_struct *napi);
Call napi_schedule() from the hardware interrupt handler to queue the NAPI instance for polling. The pattern is:
static irqreturn_t my_irq_handler(int irq, void *dev_id)
{
    struct my_priv *priv = dev_id;

    /* Mask the interrupt; NAPI poll will re-enable it */
    my_hw_disable_irq(priv);
    napi_schedule(&priv->napi);
    return IRQ_HANDLED;
}
4

Poll callback

static int my_poll(struct napi_struct *napi, int budget)
{
    struct my_priv *priv = container_of(napi, struct my_priv, napi);
    int work_done = 0;

    while (work_done < budget && my_hw_has_rx_packet(priv)) {
        struct sk_buff *skb = my_hw_receive_packet(priv);
        napi_gro_receive(napi, skb); /* or netif_receive_skb(skb) */
        work_done++;
    }

    if (work_done < budget) {
        /* All pending packets processed: complete and re-enable irq */
        napi_complete_done(napi, work_done);
        my_hw_enable_irq(priv);
    }
    return work_done;
}
If exactly budget packets were processed, do not call napi_complete_done(). The kernel will call poll again immediately. This handles the case where the ring refilled during processing.
5

Disable before closing the device

void napi_disable(struct napi_struct *napi);
Disables the NAPI instance and waits for any in-progress poll to complete. Call in ndo_stop() before freeing DMA rings. napi_disable() is not idempotent — calling it twice in a row deadlocks.

Packet Reception Path

The path from hardware interrupt to socket receive buffer:
/* In the NAPI poll callback, after building the skb: */

/* Option 1: direct delivery (no GRO) */
int netif_receive_skb(struct sk_buff *skb);

/* Option 2: via GRO (Generic Receive Offload) — preferred for Ethernet */
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
netif_receive_skb() delivers the skb to the protocol handlers registered for skb->protocol. It runs through TC ingress, XDP (if loaded), netfilter hooks, and then dispatches to the appropriate L3 handler (e.g., ip_rcv() for IPv4). napi_gro_receive() first passes the skb through GRO, which may merge it with other in-flight skbs of the same flow into a single larger skb before delivering to the stack. This reduces per-packet overhead for TCP workloads.
Before calling either function, set skb->protocol (e.g., with eth_type_trans()), skb->dev, and configure skb->ip_summed if the device performed checksum offload.
/* Typical Ethernet Rx path in a NAPI poll: */
skb->protocol = eth_type_trans(skb, dev); /* strips ETH header, sets protocol */
skb->ip_summed = CHECKSUM_UNNECESSARY;    /* if HW verified the checksum */
napi_gro_receive(napi, skb);

Packet Transmission Path

Packets enter the transmit path through the socket layer and travel down to the driver’s ndo_start_xmit:
/*
 * The kernel calls ndo_start_xmit() with the skb to transmit.
 * The driver must DMA-map, enqueue to HW ring, and return NETDEV_TX_OK.
 * If the ring is full, stop the queue and return NETDEV_TX_BUSY.
 */
static netdev_tx_t my_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    struct my_priv *priv = netdev_priv(dev);

    if (my_ring_full(priv)) {
        netif_stop_queue(dev);
        return NETDEV_TX_BUSY;
    }

    dma_addr_t dma = dma_map_single(&priv->pdev->dev,
                                    skb->data, skb->len, DMA_TO_DEVICE);
    my_hw_enqueue(priv, dma, skb->len, skb);
    return NETDEV_TX_OK;
}

/* In the Tx completion interrupt / NAPI poll: */
dma_unmap_single(dev, dma, len, DMA_TO_DEVICE);
dev_consume_skb_any(skb);  /* packet was successfully sent */

Build docs developers (and LLMs) love