Networking Subsystem API

The Linux networking stack is built around three central abstractions: the socket buffer (sk_buff) that carries packet data through every layer, the network device (net_device) that represents a physical or virtual interface, and the NAPI poll mechanism that batches interrupt-driven packet processing for high throughput.

Socket Buffers

sk_buff allocation, manipulation, and lifecycle

Network Devices

net_device registration and net_device_ops

NAPI

Interrupt-driven poll for high-throughput Rx/Tx

Socket Buffer (`sk_buff`)

struct sk_buff (always referred to as skb) is the fundamental metadata structure for every packet in the Linux kernel. It does not itself hold packet data; instead it describes where the data lives via four pointers into a contiguous buffer.

                              ---------------
                             | sk_buff       |
                              ---------------
    ,---------------------------  + head
   /          ,-----------------  + data
  /          /      ,-----------  + tail
 |          |      |            , + end
 |          |      |           |
 v          v      v           v
  -----------------------------------------------
 | headroom | data |  tailroom | skb_shared_info |
  -----------------------------------------------

sk_buff.head is the start of the allocation. sk_buff.data points to the first byte of valid packet data. sk_buff.tail marks the end of valid data. sk_buff.end is the end of the entire allocation, immediately followed by struct skb_shared_info.

Key `sk_buff` Fields

Data pointers

head

unsigned char *

Start of the allocated buffer. Never moved after allocation. The space between head and data is headroom, available for prepending protocol headers with skb_push().

data

unsigned char *

Pointer to the first byte of current packet data. Advanced by skb_pull() on receive and retracted by skb_push() on transmit.

tail

sk_buff_data_t

Byte offset (on 64-bit) or pointer (on 32-bit) to the end of valid data. Extended by skb_put() when appending data.

end

sk_buff_data_t

End of the main buffer. The struct skb_shared_info is stored immediately at this offset. Use skb_end_pointer() to obtain the address.

Length fields

len

unsigned int

Total length of packet data, including data in fragments (data_len). Always equals tail - data plus data_len.

data_len

unsigned int

Length of data held in skb_shared_info page fragments and frag_list. Zero for packets with all data in the linear buffer.

mac_len

__u16

Length of the link-layer (MAC) header.

truesize

unsigned int

Total memory consumption of this skb, including the sk_buff struct itself, linear data, and all fragments. Used for socket memory accounting.

Protocol and device fields

protocol

__be16

Packet protocol as seen by the driver (e.g., ETH_P_IP, ETH_P_IPV6). Set by the driver before calling netif_receive_skb().

dev

struct net_device *

The network device this skb arrived on or is being sent out of. May be NULL in some protocol-internal paths.

struct sock *

The socket that owns this buffer (set when the packet is associated with a socket, e.g., during receive demuxing).

ip_summed

__u8

Checksum status. One of CHECKSUM_NONE, CHECKSUM_UNNECESSARY, CHECKSUM_COMPLETE, or CHECKSUM_PARTIAL. Drivers advertising NETIF_F_RXCSUM set CHECKSUM_UNNECESSARY for verified packets.

pkt_type

__u8

Packet class: PACKET_HOST (addressed to this host), PACKET_BROADCAST, PACKET_MULTICAST, PACKET_OTHERHOST.

hash

__u32

Flow hash, used for RSS and load-balancing across queues.

Checksum and GSO fields

csum

__wsum

Checksum value. Interpretation depends on ip_summed: holds the full packet checksum when CHECKSUM_COMPLETE, or the pseudo-header checksum when CHECKSUM_PARTIAL.

csum_start

__u16

Offset from skb->head at which checksum computation begins (used with CHECKSUM_PARTIAL).

csum_offset

__u16

Offset from csum_start at which the computed checksum is to be stored.

Allocation and Deallocation

alloc_skb
dev_alloc_skb
kfree_skb / consume_skb

struct sk_buff *alloc_skb(unsigned int size, gfp_t priority);

Allocates an sk_buff with a linear data buffer of size bytes. The priority argument is passed directly to the page allocator (e.g., GFP_KERNEL, GFP_ATOMIC). Use GFP_ATOMIC in interrupt context or when holding a spinlock.

size

unsigned int

required

Number of bytes to allocate for the linear data buffer. The actual allocation is rounded up to SMP_CACHE_BYTES alignment. Does not include struct skb_shared_info, which is appended automatically.

priority

gfp_t

required

GFP allocation flags. Use GFP_ATOMIC in non-sleepable contexts (interrupt handlers, softirq). Use GFP_KERNEL when it is safe to sleep.

Returns a pointer to the new sk_buff on success, or NULL on allocation failure.

struct sk_buff *dev_alloc_skb(unsigned int length);

Convenience wrapper around alloc_skb() using GFP_ATOMIC, with an additional 16-byte headroom reserved automatically via skb_reserve(). Intended for use by network drivers in interrupt/softirq context to allocate receive buffers. The 16-byte headroom aligns IP headers on cache-line boundaries for most architectures.

length

unsigned int

required

Desired data buffer size. The 16-byte headroom is added on top; the caller receives a buffer with length bytes available after the reserved headroom.

void kfree_skb(struct sk_buff *skb);
void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason);
void consume_skb(struct sk_buff *skb);

kfree_skb() drops the skb and records a drop event in the kernel’s drop monitor. Use it when the packet is being discarded (e.g., due to an error). consume_skb() is the correct call when the packet was successfully consumed by the stack — it suppresses drop monitoring. Both decrement the reference count and free the buffer when it reaches zero.

Buffer Manipulation

These functions adjust the data and tail pointers to add or remove protocol headers as the packet traverses the stack:

skb_reserve — allocate headroom

void skb_reserve(struct sk_buff *skb, int len);

Advances skb->data and skb->tail by len bytes before any data is placed in the buffer. Must be called on a freshly allocated, empty skb. Creates headroom for protocol headers that will be prepended later with skb_push().

struct sk_buff *skb = dev_alloc_skb(MAX_HEADER + sizeof(payload));
skb_reserve(skb, MAX_HEADER);  /* leave room for L2/L3/L4 headers */
memcpy(skb_put(skb, sizeof(payload)), payload, sizeof(payload));

skb_put — append data at the tail

void *skb_put(struct sk_buff *skb, unsigned int len);
void *skb_put_zero(struct sk_buff *skb, unsigned int len);
void *skb_put_data(struct sk_buff *skb, const void *data, unsigned int len);

Extends the data area by len bytes at the tail, advancing skb->tail and increasing skb->len. Returns a pointer to the start of the newly added region. skb_put_zero() zero-initializes the region. skb_put_data() also copies len bytes from data.

skb_put() will BUG() if there is insufficient tailroom. Always verify available tailroom with skb_tailroom() first, or ensure the buffer was allocated with sufficient space.

skb_push — prepend a header

void *skb_push(struct sk_buff *skb, unsigned int len);

Decrements skb->data by len bytes, exposing len bytes of headroom for a new protocol header. Increases skb->len. Returns the new skb->data. Used by each protocol layer to prepend its header as the packet travels down the stack toward the driver.

/* Prepend an Ethernet header */
struct ethhdr *eth = skb_push(skb, sizeof(*eth));
memcpy(eth->h_dest, dest_mac, ETH_ALEN);

skb_pull — strip a header

void *skb_pull(struct sk_buff *skb, unsigned int len);

Advances skb->data by len bytes, effectively stripping len bytes from the front of the packet. Decreases skb->len. Returns the new skb->data, or NULL if len exceeds skb->len. Used on the receive path as each layer consumes its header.

/* Pull off the Ethernet header after processing */
if (!skb_pull(skb, sizeof(struct ethhdr)))
    goto drop;

Network Device (`net_device`)

struct net_device represents every network interface visible to the kernel — physical NICs, virtual interfaces (loopback, VLANs, bridges), and tunnel endpoints. Drivers register a net_device to attach to the networking stack.

Registration

Allocate a net_device

struct net_device *alloc_netdev(int sizeof_priv,
                                const char *name,
                                unsigned char name_assign_type,
                                void (*setup)(struct net_device *));

/* Ethernet-specific helper */
struct net_device *alloc_etherdev(int sizeof_priv);

alloc_etherdev() is equivalent to alloc_netdev() with ether_setup as the setup callback, which fills in Ethernet defaults (MTU=1500, type=ARPHRD_ETHER, addr_len=6, and standard features).

Fill in device fields and ops

struct my_priv *priv = netdev_priv(dev);
dev->netdev_ops = &my_netdev_ops;
dev->ethtool_ops = &my_ethtool_ops;
dev->features   |= NETIF_F_HW_CSUM | NETIF_F_SG;
SET_NETDEV_DEV(dev, &pdev->dev); /* set parent device for sysfs */

int register_netdev(struct net_device *dev);

Registers the device with the networking core. On success the interface appears in the system (e.g., visible via ip link) and can be brought up with IFF_UP. Returns 0 on success, negative errno on failure.

register_netdev() must be called only after all fields and callbacks are fully initialized. The device may begin receiving traffic immediately upon return.

Unregister on removal

void unregister_netdev(struct net_device *dev);
void free_netdev(struct net_device *dev);

unregister_netdev() detaches the device from the stack and waits for all in-progress operations to complete. Call free_netdev() afterwards to release the net_device allocation. On PCIe driver removal, always unregister before calling pci_iounmap() or freeing DMA resources.

`net_device_ops`

The net_device_ops structure contains the driver callbacks that implement network device behavior. Set dev->netdev_ops to a const struct net_device_ops before calling register_netdev().

Transmit callbacks

ndo_start_xmit

netdev_tx_t (*)(struct sk_buff *, struct net_device *)

Required. Called by the kernel to transmit a packet. The driver takes ownership of the skb and must either hand it to hardware and call dev_consume_skb_any() on completion, or drop it with dev_kfree_skb_any(). Must return NETDEV_TX_OK or NETDEV_TX_BUSY.

ndo_select_queue

u16 (*)(struct net_device *, struct sk_buff *, struct net_device *)

Optional. Select which transmit queue to use for a given skb. If absent, the kernel uses its default multiqueue selection algorithm based on the skb’s flow hash.

ndo_tx_timeout

void (*)(struct net_device *, unsigned int txqueue)

Called by the watchdog when a transmit queue has been stopped for longer than dev->watchdog_timeo jiffies. The driver should reset the hardware and restart the queue.

State and configuration callbacks

ndo_open

int (*)(struct net_device *)

Called when the interface is brought up (IFF_UP). The driver should allocate DMA rings, enable interrupts, start the hardware, and call netif_start_queue() or netif_tx_start_all_queues().

ndo_stop

int (*)(struct net_device *)

Called when the interface is brought down. Disable interrupts, stop DMA, and drain queues. The NAPI instance must be disabled here with napi_disable() before freeing any rings.

ndo_get_stats64

void (*)(struct net_device *, struct rtnl_link_stats64 *)

Fill in 64-bit per-interface statistics. Preferred over the deprecated ndo_get_stats for all new drivers.

ndo_set_mac_address

int (*)(struct net_device *, void *)

Optional. Change the device MAC address. The argument is a struct sockaddr *. Return -EADDRNOTAVAIL if the address is invalid for the device type.

ndo_set_rx_mode

void (*)(struct net_device *)

Update the device’s multicast filter list and promiscuous mode based on dev->flags and dev->mc.

Queue Control

/* Stop a specific transmit queue (e.g., when ring is full) */
void netif_stop_queue(struct net_device *dev);
void netif_tx_stop_queue(struct netdev_queue *txq);

/* Restart a stopped queue (e.g., after ring space freed) */
void netif_wake_queue(struct net_device *dev);
void netif_tx_wake_queue(struct netdev_queue *txq);

/* Start a queue (used in ndo_open, before traffic can flow) */
void netif_start_queue(struct net_device *dev);
void netif_tx_start_all_queues(struct net_device *dev);

netif_stop_queue() and netif_wake_queue() must be called with care from the transmit path. Stopping a queue signals the kernel scheduler not to call ndo_start_xmit again for that queue. Always pair each stop with a corresponding wake to avoid queue stalls.

NAPI (New API)

NAPI is the event-handling mechanism for packet reception in the Linux network stack. Instead of processing each packet in a separate hardware interrupt, NAPI coalesces interrupt-driven notifications into batched software polls, dramatically reducing per-packet interrupt overhead at high packet rates. Operating model:

A hardware interrupt fires indicating new packets are available.
The driver masks the interrupt and calls napi_schedule().
The kernel invokes the driver’s poll method in softirq context.
The poll method processes up to budget packets and calls napi_complete_done() when done, re-enabling the interrupt.

`napi_struct`

struct napi_struct {
    struct list_head    poll_list;  /* internal: queued on softirq poll list */
    unsigned long       state;      /* NAPI state bits */
    int                 weight;     /* max packets per poll (budget hint) */
    int (*poll)(struct napi_struct *, int); /* driver poll callback */
    /* ... additional internal fields ... */
};

Drivers embed one napi_struct per hardware receive queue in their private data structure.

Control API

Add a NAPI instance

void netif_napi_add(struct net_device *dev,
                    struct napi_struct *napi,
                    int (*poll)(struct napi_struct *, int),
                    int weight);

Registers a new NAPI instance for the given net_device. The poll callback is called with a budget (usually 64 for the default weight) indicating the maximum number of Rx packets to process per invocation. Newly added instances start in the disabled state.

dev

struct net_device *

required

The network device this NAPI instance belongs to. The instance is automatically removed when dev is unregistered.

napi

struct napi_struct *

required

Pointer to the napi_struct embedded in the driver’s per-queue private data.

poll

function pointer

required

The driver’s poll callback. Must process at most budget receive packets, call napi_complete_done() when done, and return the actual number of packets processed.

weight

int

required

Desired budget. Typically NAPI_POLL_WEIGHT (64). Use lower values for slow devices. Must not exceed NAPI_POLL_WEIGHT.

Enable before opening the device

void napi_enable(struct napi_struct *napi);

Transitions the NAPI instance from disabled to enabled. Call this in ndo_open() before enabling hardware interrupts. A disabled NAPI instance cannot be scheduled — any call to napi_schedule() while disabled is a no-op.

Schedule from interrupt handler

void napi_schedule(struct napi_struct *napi);
bool napi_schedule_prep(struct napi_struct *napi);
void __napi_schedule(struct napi_struct *napi);

Call napi_schedule() from the hardware interrupt handler to queue the NAPI instance for polling. The pattern is:

static irqreturn_t my_irq_handler(int irq, void *dev_id)
{
    struct my_priv *priv = dev_id;

    /* Mask the interrupt; NAPI poll will re-enable it */
    my_hw_disable_irq(priv);
    napi_schedule(&priv->napi);
    return IRQ_HANDLED;
}

Poll callback

static int my_poll(struct napi_struct *napi, int budget)
{
    struct my_priv *priv = container_of(napi, struct my_priv, napi);
    int work_done = 0;

    while (work_done < budget && my_hw_has_rx_packet(priv)) {
        struct sk_buff *skb = my_hw_receive_packet(priv);
        napi_gro_receive(napi, skb); /* or netif_receive_skb(skb) */
        work_done++;
    }

    if (work_done < budget) {
        /* All pending packets processed: complete and re-enable irq */
        napi_complete_done(napi, work_done);
        my_hw_enable_irq(priv);
    }
    return work_done;
}

If exactly budget packets were processed, do not call napi_complete_done(). The kernel will call poll again immediately. This handles the case where the ring refilled during processing.

Disable before closing the device

void napi_disable(struct napi_struct *napi);

Disables the NAPI instance and waits for any in-progress poll to complete. Call in ndo_stop() before freeing DMA rings. napi_disable() is not idempotent — calling it twice in a row deadlocks.

Packet Reception Path

The path from hardware interrupt to socket receive buffer:

/* In the NAPI poll callback, after building the skb: */

/* Option 1: direct delivery (no GRO) */
int netif_receive_skb(struct sk_buff *skb);

/* Option 2: via GRO (Generic Receive Offload) — preferred for Ethernet */
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);

netif_receive_skb() delivers the skb to the protocol handlers registered for skb->protocol. It runs through TC ingress, XDP (if loaded), netfilter hooks, and then dispatches to the appropriate L3 handler (e.g., ip_rcv() for IPv4). napi_gro_receive() first passes the skb through GRO, which may merge it with other in-flight skbs of the same flow into a single larger skb before delivering to the stack. This reduces per-packet overhead for TCP workloads.

Before calling either function, set skb->protocol (e.g., with eth_type_trans()), skb->dev, and configure skb->ip_summed if the device performed checksum offload.

/* Typical Ethernet Rx path in a NAPI poll: */
skb->protocol = eth_type_trans(skb, dev); /* strips ETH header, sets protocol */
skb->ip_summed = CHECKSUM_UNNECESSARY;    /* if HW verified the checksum */
napi_gro_receive(napi, skb);

Packet Transmission Path

Packets enter the transmit path through the socket layer and travel down to the driver’s ndo_start_xmit:

/*
 * The kernel calls ndo_start_xmit() with the skb to transmit.
 * The driver must DMA-map, enqueue to HW ring, and return NETDEV_TX_OK.
 * If the ring is full, stop the queue and return NETDEV_TX_BUSY.
 */
static netdev_tx_t my_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    struct my_priv *priv = netdev_priv(dev);

    if (my_ring_full(priv)) {
        netif_stop_queue(dev);
        return NETDEV_TX_BUSY;
    }

    dma_addr_t dma = dma_map_single(&priv->pdev->dev,
                                    skb->data, skb->len, DMA_TO_DEVICE);
    my_hw_enqueue(priv, dma, skb->len, skb);
    return NETDEV_TX_OK;
}

/* In the Tx completion interrupt / NAPI poll: */
dma_unmap_single(dev, dma, len, DMA_TO_DEVICE);
dev_consume_skb_any(skb);  /* packet was successfully sent */

Core APIs

Subsystem APIs

Networking Subsystem API

Socket Buffers

Network Devices

NAPI

Socket Buffer (`sk_buff`)

Key `sk_buff` Fields

Allocation and Deallocation

Buffer Manipulation

Network Device (`net_device`)

Registration

`net_device_ops`

Queue Control

NAPI (New API)

`napi_struct`

Control API

Packet Reception Path

Packet Transmission Path

Build docs developers (and LLMs) love

Core APIs

Subsystem APIs

Socket Buffers

Network Devices

NAPI

​Socket Buffer (sk_buff)

​Key sk_buff Fields

​Allocation and Deallocation

​Buffer Manipulation

​Network Device (net_device)

​Registration

​net_device_ops

​Queue Control

​NAPI (New API)

​napi_struct

​Control API

​Packet Reception Path

​Packet Transmission Path

Build docs developers (and LLMs) love

Socket Buffer (`sk_buff`)

Key `sk_buff` Fields

Allocation and Deallocation

Buffer Manipulation

Network Device (`net_device`)

Registration

`net_device_ops`

Queue Control

NAPI (New API)

`napi_struct`

Control API

Packet Reception Path

Packet Transmission Path