Skip to main content
The Linux networking stack is a layered implementation of the Internet protocol suite that spans from hardware device drivers up to the socket API exposed to user-space applications. The source lives primarily in net/, with driver implementations in drivers/net/.

Socket buffer (sk_buff)

The socket buffer (struct sk_buff, defined in include/linux/skbuff.h) is the fundamental data structure that carries packets through the entire networking stack. A single sk_buff represents one network packet and travels from the driver’s receive ring upward through L2/L3/L4 processing, or downward from the socket layer to transmission.
struct sk_buff {
    /* Data pointers */
    unsigned char   *head;   /* start of allocated buffer */
    unsigned char   *data;   /* start of packet data */
    unsigned char   *tail;   /* end of packet data */
    unsigned char   *end;    /* end of allocated buffer */

    /* Packet metadata */
    __u32            len;       /* total length of packet data */
    __u32            data_len;  /* length in page frags */
    __u16            protocol;  /* L3 protocol type */
    __u8             pkt_type;  /* PACKET_HOST, BROADCAST, etc. */

    /* Device and socket */
    struct net_device *dev;
    struct sock       *sk;

    /* Timestamp, mark, priority ... */
};
Headers are added and removed by adjusting the data and tail pointers, avoiding data copies as the packet traverses layers. Space is reserved at the head for lower-layer headers during transmission (skb_reserve), and headers are pushed down into that space as the packet descends.
/* Reserve headroom for lower-layer headers */
skb_reserve(skb, NET_IP_ALIGN + ETH_HLEN);

/* Push an Ethernet header */
struct ethhdr *eth = (struct ethhdr *)skb_push(skb, ETH_HLEN);

/* Pull off the Ethernet header on receive */
skb_pull(skb, ETH_HLEN);
The sk_buff uses a reference count (skb->users). Cloning with skb_clone() creates a new sk_buff that shares the data area. Copying with skb_copy() produces a fully independent buffer.

Protocol layers

The stack is divided into discrete processing layers. Each layer hands the sk_buff to the next via a well-defined function call.

Netfilter and iptables hooks

Netfilter (net/netfilter/) provides a framework of hooks at key points in the packet path. Kernel modules register callbacks at these hooks to implement firewalling, NAT, connection tracking, and packet mangling.
Incoming packet:
  NF_INET_PRE_ROUTING → routing decision
        ├── local delivery → NF_INET_LOCAL_IN → socket
        └── forward       → NF_INET_FORWARD → NF_INET_POST_ROUTING → TX

Outgoing packet:
  socket → NF_INET_LOCAL_OUT → NF_INET_POST_ROUTING → TX
iptables and nftables are user-space tools that program Netfilter rules via the kernel’s setsockopt-based API (iptables) or a dedicated Netlink family (nftables).
# Drop incoming packets from a specific address (iptables)
iptables -A INPUT -s 192.168.1.100 -j DROP

# Equivalent nftables rule
nft add rule inet filter input ip saddr 192.168.1.100 drop

# Masquerade outbound traffic (NAT)
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
Connection tracking (nf_conntrack) maintains a table of established connections, allowing stateful filtering and NAT session management.

eBPF and XDP

eBPF (extended Berkeley Packet Filter) lets user-supplied programs run safely inside the kernel at various hook points, verified by the kernel’s in-kernel verifier before execution. XDP (eXpress Data Path) attaches eBPF programs directly to a network driver’s receive path, running before the sk_buff is allocated — achieving near-line-rate packet processing.
/* XDP program skeleton (runs in driver context) */
SEC("xdp")
int xdp_drop_icmp(struct xdp_md *ctx)
{
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;

    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    if (ip->protocol == IPPROTO_ICMP)
        return XDP_DROP;

    return XDP_PASS;
}
XDP actions:
ActionMeaning
XDP_PASSHand packet to the normal network stack
XDP_DROPDrop the packet immediately
XDP_TXTransmit the packet back out the same interface
XDP_REDIRECTRedirect to another interface or CPU queue
XDP_ABORTEDDrop and generate a trace event
# Load an XDP program onto an interface
ip link set dev eth0 xdp obj xdp_prog.o sec xdp

# Using bpftool
bpftool prog load xdp_prog.o /sys/fs/bpf/myprog
bpftool net attach xdp pinned /sys/fs/bpf/myprog dev eth0

Network device drivers and NAPI

Network drivers use the NAPI (New API) interface (include/linux/netdevice.h) to batch packet processing and reduce interrupt overhead at high packet rates.
1

Interrupt fires

The NIC raises a hardware interrupt when packets arrive. The driver’s interrupt handler disables further NIC interrupts and schedules a NAPI poll.
static irqreturn_t driver_interrupt(int irq, void *data)
{
    /* disable NIC interrupt */
    napi_schedule(&priv->napi);
    return IRQ_HANDLED;
}
2

NAPI poll loop

The kernel calls the driver’s poll() method in a softirq context, allowing it to drain a batch of up to budget packets from the hardware ring buffer.
static int driver_poll(struct napi_struct *napi, int budget)
{
    int work_done = 0;
    while (work_done < budget && rx_ring_has_packets()) {
        struct sk_buff *skb = receive_one_packet();
        netif_receive_skb(skb);
        work_done++;
    }
    if (work_done < budget) {
        napi_complete_done(napi, work_done);
        enable_nic_interrupts();
    }
    return work_done;
}
3

Packet delivered to stack

netif_receive_skb() delivers the packet to registered L2 protocol handlers and Netfilter hooks, beginning the upward journey through the stack.
# Tune NAPI budget (default 64 packets per poll)
sysctl -w net.core.dev_weight=128

# View NIC ring buffer and interrupt settings
ethtool -g eth0    # ring buffer sizes
ethtool -l eth0    # channel/queue counts
ethtool -c eth0    # interrupt coalescing

TCP/IP stack details

Data flows: write(2)tcp_sendmsg() → TCP segmentation → ip_queue_xmit() → Netfilter POST_ROUTING → device queue → driver TX ring.TCP maintains a send buffer (sk->sk_sndbuf) and congestion window (tp->snd_cwnd). The minimum of the two limits how much data can be in flight.
Packets arrive: driver → netif_receive_skb()ip_rcv()tcp_v4_rcv() → socket receive queue → read(2) in user space.Out-of-order packets are held in the OOO queue (tp->out_of_order_queue) and spliced into the receive buffer when their sequence numbers are reached.
Modern NICs can offload TCP segmentation (TSO), checksum computation (TX/RX csum offload), and large receive coalescing (LRO/GRO) to hardware, dramatically reducing CPU overhead at high throughput.
ethtool -k eth0 | grep offload
ethtool -K eth0 tso on gro on

Network namespaces

Network namespaces (net/core/net_namespace.c) provide isolated network stacks. Each namespace has its own interfaces, routing tables, iptables rules, and socket table. Containers (Docker, Kubernetes pods) rely heavily on network namespaces for isolation.
# Create a new network namespace
ip netns add myns

# Run a command inside the namespace
ip netns exec myns ip link show

# Move a physical interface into a namespace
ip link set eth1 netns myns

# Create a veth pair to connect two namespaces
ip link add veth0 type veth peer name veth1
ip link set veth1 netns myns
/* Kernel API — get the network namespace of a socket */
struct net *sock_net(const struct sock *sk);

/* Iterate over all network namespaces */
for_each_net(net) {
    /* process each namespace */
}
The initial network namespace is init_net. All network sysctl knobs under /proc/sys/net/ are per-namespace, allowing containers to have independent TCP buffer sizes, forwarding settings, and so on.

Build docs developers (and LLMs) love