Skip to main content
The Linux memory management (MM) subsystem handles the entire lifecycle of memory in the system: from discovering physical RAM at boot time, through allocating and mapping pages for processes and the kernel, to reclaiming pages under pressure and swapping them to disk. The source lives in mm/ with architecture-specific page table code in each arch/*/mm/ directory.

Physical memory model

Linux abstracts the diversity of physical memory layouts using one of two memory models selected at build time: FLATMEM and SPARSEMEM. Both track physical page frames using struct page objects arranged in arrays, maintaining a one-to-one mapping between a Page Frame Number (PFN) and its struct page.
FLATMEM is the simplest model, suited for non-NUMA systems with contiguous physical memory. A single global mem_map array covers the entire physical address space.
/* PFN to struct page conversion under FLATMEM */
#define __pfn_to_page(pfn)  (mem_map + ((pfn) - ARCH_PFN_OFFSET))
#define __page_to_pfn(page) ((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)
The ARCH_PFN_OFFSET accounts for systems whose physical memory starts at an address other than 0. Architecture setup code calls free_area_init() to allocate the array, which becomes usable after memblock_free_all() hands memory to the page allocator.
ZONE_DEVICE builds on SPARSEMEM_VMEMMAP to provide struct page services for device-owned memory (persistent memory via DAX, GPU memory via HMM, peer-to-peer DMA via p2pdma) without ever marking those pages online.

Memory zones

The kernel partitions physical memory into zones that reflect hardware constraints on which addresses certain operations can use.
ZonePurpose
ZONE_DMAMemory accessible by legacy ISA DMA (typically first 16 MB on x86)
ZONE_DMA32Memory below 4 GB, required by 32-bit-only DMA devices
ZONE_NORMALDirectly mapped kernel memory; the workhorse zone
ZONE_HIGHMEMPhysical memory above the kernel’s direct mapping limit (32-bit only)
ZONE_MOVABLEPages that can be migrated, enabling memory hot-remove
ZONE_DEVICEDevice-managed memory (pmem, GPU)
Each zone maintains its own free-page lists and tracks statistics such as NR_FREE_PAGES and NR_INACTIVE_ANON.

Buddy allocator

The buddy allocator (mm/page_alloc.c) is the primary physical page allocator. It manages free memory in power-of-two blocks called orders (order 0 = 1 page, order 1 = 2 pages, …, order MAX_ORDER = 4096 pages on most configs).
/* Allocate 2^order contiguous pages */
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);

/* Free pages back to the buddy system */
void __free_pages(struct page *page, unsigned int order);
When a block of order N is freed, the allocator checks whether its buddy (the adjacent block of the same size) is also free. If so, they are merged into an order N+1 block. This coalescing continues until no further merging is possible, keeping external fragmentation low.
Free lists per zone (simplified):
  order 0:  [page] [page] ...
  order 1:  [2-page block] ...
  order 2:  [4-page block] ...
  ...
  order 10: [1024-page block] ...
Use cat /proc/buddyinfo to inspect the current free-list state for each zone and NUMA node.

SLUB allocator

The SLUB allocator (mm/slub.c) sits on top of the buddy allocator and provides efficient allocation of small, fixed-size kernel objects (e.g., struct task_struct, struct inode, network buffers). It replaced the original SLAB allocator as the default. From the source (mm/slub.c):
/*
 * SLUB: A slab allocator with low overhead percpu array caches and mostly
 * lockless freeing of objects to slabs in the slowpath.
 *
 * The allocator synchronizes using spin_trylock for percpu arrays in the
 * fastpath, and cmpxchg_double (or bit spinlock) for slowpath freeing.
 * Uses a centralized lock to manage a pool of partial slabs.
 *
 * (C) 2007 SGI, Christoph Lameter
 */
SLUB organises memory into caches, each serving objects of one specific size and type. Each cache has per-CPU slab magazines for lockless fast-path allocation and a per-node partial-slab list for the slow path.
/* Create a new object cache */
struct kmem_cache *kmem_cache_create(
    const char *name,
    unsigned int size,
    unsigned int align,
    slab_flags_t flags,
    void (*ctor)(void *));

/* Allocate one object */
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags);

/* Free an object */
void kmem_cache_free(struct kmem_cache *s, void *x);
The SLUB lock order from the source:
0. cpu_hotplug_lock
1. slab_mutex          (global mutex, protects slab list)
2a. kmem_cache->cpu_sheaves->lock  (local trylock)
2b. node->barn->lock               (spinlock)
2c. node->list_lock                (spinlock)
3. slab_lock(slab)    (bit spinlock, arches without cmpxchg_double)
4. object_map_lock    (debug only)

Virtual memory and page tables

Each process has a private virtual address space described by struct mm_struct. Within that space, contiguous regions are represented as Virtual Memory Areas (struct vm_area_struct, or VMA). VMAs record permissions, backing (anonymous, file-mapped, or device), and the associated file/offset if any. Page tables translate virtual addresses to physical ones. Linux uses a multi-level page table hierarchy whose depth varies by architecture:
ArchitectureLevelsTypical depth
x86-64 (4-level)PGD → P4D → PUD → PMD → PTE4
x86-64 (5-level)PGD → P4D → PUD → PMD → PTE5
AArch64PGD → PUD → PMD → PTE3–4
RISC-V (Sv48)PGD → PUD → PMD → PTE4
The TLB caches recent translations. On context switch, the kernel either flushes the TLB or, when PCID (process-context identifiers) are available on x86-64, tags TLB entries to avoid flushing.

vmalloc

vmalloc() allocates virtually contiguous but physically discontiguous memory. It is used when large contiguous physical allocations would fail due to fragmentation, but a logically contiguous virtual range is needed (e.g., for module text, large kernel buffers).
#include <linux/vmalloc.h>

void *vmalloc(unsigned long size);
void  vfree(const void *addr);

/* Physically contiguous variant (for DMA) */
void *dma_alloc_coherent(struct device *dev, size_t size,
                          dma_addr_t *dma_handle, gfp_t flag);
vmalloc memory requires TLB entries for every page in the range and incurs higher access latency than kmalloc. Prefer kmalloc/kmem_cache_alloc for small, frequently allocated objects.

NUMA support

On Non-Uniform Memory Access (NUMA) systems, memory latency depends on the distance between a CPU and the memory bank being accessed. Linux models topology as a graph of nodes, each with its own zones and free lists. The NUMA-aware page allocator tries to satisfy allocations from the requesting CPU’s local node first. alloc_pages_node() lets callers specify a target node explicitly:
/* Allocate from a specific NUMA node */
struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order);
CONFIG_NUMA_BALANCING enables automatic NUMA balancing: the kernel periodically unmaps process pages, detects which NUMA node the process is actually accessing them from, and migrates pages to the closer node.
# Inspect NUMA topology
numactl --hardware

# Bind a process to a specific node
numactl --membind=0 --cpunodebind=0 ./myprogram

Huge pages and Transparent Huge Pages

Standard pages are 4 KB. Huge pages (2 MB on x86-64, 1 GB with 1 GB pages) reduce TLB pressure significantly for large working sets by replacing 512 page table entries with a single PMD-level mapping.
Static huge pages reserved at boot or via sysctl. Processes must explicitly use mmap(MAP_HUGETLB) or mount hugetlbfs.
# Reserve 512 × 2 MB huge pages
echo 512 > /proc/sys/vm/nr_hugepages

# Use in a program
void *p = mmap(NULL, 2 << 20, PROT_READ|PROT_WRITE,
               MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);

Memory reclaim and swapping

When free memory falls below watermarks, the kernel reclaims pages through the page reclaim path (mm/vmscan.c):
1

kswapd wakes up

Per-node kswapd kernel threads wake when free pages drop below pages_low. They scan the LRU lists to find candidate pages to reclaim.
2

Page aging via LRU lists

Pages move between inactive and active LRU lists. Frequently accessed pages get promoted to the active list; cold pages age toward the inactive list and become reclaim candidates.
3

Anonymous pages are swapped

Anonymous pages (heap, stack, mmap-anonymous) with no file backing must be written to the swap area before they can be freed. The swap subsystem (mm/swap_state.c, mm/swapfile.c) manages swap space as a block device or file.
4

File-backed pages are dropped or written

Clean file-backed pages can simply be dropped (re-read from disk on next access). Dirty file pages are written back by writeback threads before being freed.

OOM killer

When all reclaim efforts fail and the system has no free memory, the Out-Of-Memory (OOM) killer (mm/oom_kill.c) selects a process to kill, freeing its memory. The OOM killer scores each process based on its memory footprint (oom_score) and adjusts using a per-process tunable:
# View a process's OOM score
cat /proc/<pid>/oom_score

# Adjust OOM priority (-1000 = never kill, +1000 = kill first)
echo 500 > /proc/<pid>/oom_score_adj
Setting oom_score_adj to -1000 exempts a process from OOM killing. Use this sparingly — exempting critical daemons can cause the kernel to be forced into an even more aggressive kill if memory pressure continues.

Build docs developers (and LLMs) love