Networking Stack

The Linux networking stack is a layered implementation of the Internet protocol suite that spans from hardware device drivers up to the socket API exposed to user-space applications. The source lives primarily in net/, with driver implementations in drivers/net/.

Socket buffer (sk_buff)

The socket buffer (struct sk_buff, defined in include/linux/skbuff.h) is the fundamental data structure that carries packets through the entire networking stack. A single sk_buff represents one network packet and travels from the driver’s receive ring upward through L2/L3/L4 processing, or downward from the socket layer to transmission.

struct sk_buff {
    /* Data pointers */
    unsigned char   *head;   /* start of allocated buffer */
    unsigned char   *data;   /* start of packet data */
    unsigned char   *tail;   /* end of packet data */
    unsigned char   *end;    /* end of allocated buffer */

    /* Packet metadata */
    __u32            len;       /* total length of packet data */
    __u32            data_len;  /* length in page frags */
    __u16            protocol;  /* L3 protocol type */
    __u8             pkt_type;  /* PACKET_HOST, BROADCAST, etc. */

    /* Device and socket */
    struct net_device *dev;
    struct sock       *sk;

    /* Timestamp, mark, priority ... */
};

Headers are added and removed by adjusting the data and tail pointers, avoiding data copies as the packet traverses layers. Space is reserved at the head for lower-layer headers during transmission (skb_reserve), and headers are pushed down into that space as the packet descends.

/* Reserve headroom for lower-layer headers */
skb_reserve(skb, NET_IP_ALIGN + ETH_HLEN);

/* Push an Ethernet header */
struct ethhdr *eth = (struct ethhdr *)skb_push(skb, ETH_HLEN);

/* Pull off the Ethernet header on receive */
skb_pull(skb, ETH_HLEN);

The sk_buff uses a reference count (skb->users). Cloning with skb_clone() creates a new sk_buff that shares the data area. Copying with skb_copy() produces a fully independent buffer.

Protocol layers

The stack is divided into discrete processing layers. Each layer hands the sk_buff to the next via a well-defined function call.

L2 — Link layer
L3 — Network layer
L4 — Transport layer

The link layer handles framing and addressing on the local network segment. The kernel’s L2 entry point for received frames is netif_receive_skb() → __netif_receive_skb_core(), which dispatches based on skb->protocol (e.g., ETH_P_IP, ETH_P_IPV6, ETH_P_ARP).Ethernet bridging (net/bridge/) and VLANs operate at this layer.

/* Register a protocol handler at L2 */
dev_add_pack(&ip_packet_type); /* ip_packet_type.type = ETH_P_IP */

IPv4 (net/ipv4/ip_input.c) and IPv6 (net/ipv6/ip6_input.c) handle routing, fragmentation/reassembly, and source address selection.Received packets enter ip_rcv(), pass through Netfilter’s NF_INET_PRE_ROUTING hook, go through the routing subsystem (ip_route_input_noref()), and are either forwarded or delivered to a local socket.

/* IPv4 receive path (net/ipv4/ip_input.c) */
int ip_rcv(struct sk_buff *skb, struct net_device *dev,
           struct packet_type *pt, struct net_device *orig_dev);

/* IPv4 routing lookup */
int ip_route_input_noref(struct sk_buff *skb, __be32 dst,
                          __be32 src, dscp_t dscp,
                          struct net_device *devin);

TCP (net/ipv4/tcp*.c), UDP (net/ipv4/udp.c), and other transport protocols demultiplex packets to the correct socket based on the 4-tuple (src IP, src port, dst IP, dst port).TCP implements the full RFC 793 state machine plus congestion control (CUBIC by default, BBR, DCTCP, etc. selectable via net.ipv4.tcp_congestion_control).

# View active TCP connections and their state
ss -tnp

# Change the default congestion control algorithm
sysctl -w net.ipv4.tcp_congestion_control=bbr

Netfilter and iptables hooks

Netfilter (net/netfilter/) provides a framework of hooks at key points in the packet path. Kernel modules register callbacks at these hooks to implement firewalling, NAT, connection tracking, and packet mangling.

Incoming packet:
  NF_INET_PRE_ROUTING → routing decision
        ├── local delivery → NF_INET_LOCAL_IN → socket
        └── forward       → NF_INET_FORWARD → NF_INET_POST_ROUTING → TX

Outgoing packet:
  socket → NF_INET_LOCAL_OUT → NF_INET_POST_ROUTING → TX

iptables and nftables are user-space tools that program Netfilter rules via the kernel’s setsockopt-based API (iptables) or a dedicated Netlink family (nftables).

# Drop incoming packets from a specific address (iptables)
iptables -A INPUT -s 192.168.1.100 -j DROP

# Equivalent nftables rule
nft add rule inet filter input ip saddr 192.168.1.100 drop

# Masquerade outbound traffic (NAT)
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Connection tracking (nf_conntrack) maintains a table of established connections, allowing stateful filtering and NAT session management.

eBPF and XDP

eBPF (extended Berkeley Packet Filter) lets user-supplied programs run safely inside the kernel at various hook points, verified by the kernel’s in-kernel verifier before execution. XDP (eXpress Data Path) attaches eBPF programs directly to a network driver’s receive path, running before the sk_buff is allocated — achieving near-line-rate packet processing.

/* XDP program skeleton (runs in driver context) */
SEC("xdp")
int xdp_drop_icmp(struct xdp_md *ctx)
{
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;

    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    if (ip->protocol == IPPROTO_ICMP)
        return XDP_DROP;

    return XDP_PASS;
}

XDP actions:

Action	Meaning
`XDP_PASS`	Hand packet to the normal network stack
`XDP_DROP`	Drop the packet immediately
`XDP_TX`	Transmit the packet back out the same interface
`XDP_REDIRECT`	Redirect to another interface or CPU queue
`XDP_ABORTED`	Drop and generate a trace event

# Load an XDP program onto an interface
ip link set dev eth0 xdp obj xdp_prog.o sec xdp

# Using bpftool
bpftool prog load xdp_prog.o /sys/fs/bpf/myprog
bpftool net attach xdp pinned /sys/fs/bpf/myprog dev eth0

Network device drivers and NAPI

Network drivers use the NAPI (New API) interface (include/linux/netdevice.h) to batch packet processing and reduce interrupt overhead at high packet rates.

Interrupt fires

The NIC raises a hardware interrupt when packets arrive. The driver’s interrupt handler disables further NIC interrupts and schedules a NAPI poll.

static irqreturn_t driver_interrupt(int irq, void *data)
{
    /* disable NIC interrupt */
    napi_schedule(&priv->napi);
    return IRQ_HANDLED;
}

NAPI poll loop

The kernel calls the driver’s poll() method in a softirq context, allowing it to drain a batch of up to budget packets from the hardware ring buffer.

static int driver_poll(struct napi_struct *napi, int budget)
{
    int work_done = 0;
    while (work_done < budget && rx_ring_has_packets()) {
        struct sk_buff *skb = receive_one_packet();
        netif_receive_skb(skb);
        work_done++;
    }
    if (work_done < budget) {
        napi_complete_done(napi, work_done);
        enable_nic_interrupts();
    }
    return work_done;
}

Packet delivered to stack

netif_receive_skb() delivers the packet to registered L2 protocol handlers and Netfilter hooks, beginning the upward journey through the stack.

# Tune NAPI budget (default 64 packets per poll)
sysctl -w net.core.dev_weight=128

# View NIC ring buffer and interrupt settings
ethtool -g eth0    # ring buffer sizes
ethtool -l eth0    # channel/queue counts
ethtool -c eth0    # interrupt coalescing

TCP/IP stack details

TCP send path

Data flows: write(2) → tcp_sendmsg() → TCP segmentation → ip_queue_xmit() → Netfilter POST_ROUTING → device queue → driver TX ring.TCP maintains a send buffer (sk->sk_sndbuf) and congestion window (tp->snd_cwnd). The minimum of the two limits how much data can be in flight.

TCP receive path

Packets arrive: driver → netif_receive_skb() → ip_rcv() → tcp_v4_rcv() → socket receive queue → read(2) in user space.Out-of-order packets are held in the OOO queue (tp->out_of_order_queue) and spliced into the receive buffer when their sequence numbers are reached.

TCP offloads

Modern NICs can offload TCP segmentation (TSO), checksum computation (TX/RX csum offload), and large receive coalescing (LRO/GRO) to hardware, dramatically reducing CPU overhead at high throughput.

ethtool -k eth0 | grep offload
ethtool -K eth0 tso on gro on

Network namespaces

Network namespaces (net/core/net_namespace.c) provide isolated network stacks. Each namespace has its own interfaces, routing tables, iptables rules, and socket table. Containers (Docker, Kubernetes pods) rely heavily on network namespaces for isolation.

# Create a new network namespace
ip netns add myns

# Run a command inside the namespace
ip netns exec myns ip link show

# Move a physical interface into a namespace
ip link set eth1 netns myns

# Create a veth pair to connect two namespaces
ip link add veth0 type veth peer name veth1
ip link set veth1 netns myns

/* Kernel API — get the network namespace of a socket */
struct net *sock_net(const struct sock *sk);

/* Iterate over all network namespaces */
for_each_net(net) {
    /* process each namespace */
}

The initial network namespace is init_net. All network sysctl knobs under /proc/sys/net/ are per-namespace, allowing containers to have independent TCP buffer sizes, forwarding settings, and so on.

Get Started

Kernel Internals

Development Guide

Administration

Driver Development

Socket buffer (sk_buff)

Protocol layers

Netfilter and iptables hooks

eBPF and XDP

Network device drivers and NAPI

TCP/IP stack details

Network namespaces

Build docs developers (and LLMs) love

Get Started

Kernel Internals

Development Guide

Administration

Driver Development

​Socket buffer (sk_buff)

​Protocol layers

​Netfilter and iptables hooks

​eBPF and XDP

​Network device drivers and NAPI

​TCP/IP stack details

​Network namespaces

Build docs developers (and LLMs) love

Socket buffer (sk_buff)

Protocol layers

Netfilter and iptables hooks

eBPF and XDP

Network device drivers and NAPI

TCP/IP stack details

Network namespaces