Memory Management

The Linux memory management (MM) subsystem handles the entire lifecycle of memory in the system: from discovering physical RAM at boot time, through allocating and mapping pages for processes and the kernel, to reclaiming pages under pressure and swapping them to disk. The source lives in mm/ with architecture-specific page table code in each arch/*/mm/ directory.

Physical memory model

Linux abstracts the diversity of physical memory layouts using one of two memory models selected at build time: FLATMEM and SPARSEMEM. Both track physical page frames using struct page objects arranged in arrays, maintaining a one-to-one mapping between a Page Frame Number (PFN) and its struct page.

FLATMEM
SPARSEMEM

FLATMEM is the simplest model, suited for non-NUMA systems with contiguous physical memory. A single global mem_map array covers the entire physical address space.

/* PFN to struct page conversion under FLATMEM */
#define __pfn_to_page(pfn)  (mem_map + ((pfn) - ARCH_PFN_OFFSET))
#define __page_to_pfn(page) ((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)

The ARCH_PFN_OFFSET accounts for systems whose physical memory starts at an address other than 0. Architecture setup code calls free_area_init() to allocate the array, which becomes usable after memblock_free_all() hands memory to the page allocator.

SPARSEMEM is the most versatile model and the only one that supports memory hot-plug/hot-remove, alternative memory maps for non-volatile memory, and deferred map initialisation on large systems.Physical memory is divided into fixed-size sections, each represented by struct mem_section. The section size and maximum number of sections are defined by architecture constants:

NR_MEM_SECTIONS = 2 ^ (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)

With CONFIG_SPARSEMEM_VMEMMAP, a virtually contiguous vmemmap array replaces the per-section page arrays, making pfn_to_page() an O(1) array index rather than a multi-level lookup.

ZONE_DEVICE builds on SPARSEMEM_VMEMMAP to provide struct page services for device-owned memory (persistent memory via DAX, GPU memory via HMM, peer-to-peer DMA via p2pdma) without ever marking those pages online.

Memory zones

The kernel partitions physical memory into zones that reflect hardware constraints on which addresses certain operations can use.

Zone	Purpose
`ZONE_DMA`	Memory accessible by legacy ISA DMA (typically first 16 MB on x86)
`ZONE_DMA32`	Memory below 4 GB, required by 32-bit-only DMA devices
`ZONE_NORMAL`	Directly mapped kernel memory; the workhorse zone
`ZONE_HIGHMEM`	Physical memory above the kernel’s direct mapping limit (32-bit only)
`ZONE_MOVABLE`	Pages that can be migrated, enabling memory hot-remove
`ZONE_DEVICE`	Device-managed memory (pmem, GPU)

Each zone maintains its own free-page lists and tracks statistics such as NR_FREE_PAGES and NR_INACTIVE_ANON.

Buddy allocator

The buddy allocator (mm/page_alloc.c) is the primary physical page allocator. It manages free memory in power-of-two blocks called orders (order 0 = 1 page, order 1 = 2 pages, …, order MAX_ORDER = 4096 pages on most configs).

/* Allocate 2^order contiguous pages */
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);

/* Free pages back to the buddy system */
void __free_pages(struct page *page, unsigned int order);

When a block of order N is freed, the allocator checks whether its buddy (the adjacent block of the same size) is also free. If so, they are merged into an order N+1 block. This coalescing continues until no further merging is possible, keeping external fragmentation low.

Free lists per zone (simplified):
  order 0:  [page] [page] ...
  order 1:  [2-page block] ...
  order 2:  [4-page block] ...
  ...
  order 10: [1024-page block] ...

Use cat /proc/buddyinfo to inspect the current free-list state for each zone and NUMA node.

SLUB allocator

The SLUB allocator (mm/slub.c) sits on top of the buddy allocator and provides efficient allocation of small, fixed-size kernel objects (e.g., struct task_struct, struct inode, network buffers). It replaced the original SLAB allocator as the default. From the source (mm/slub.c):

/*
 * SLUB: A slab allocator with low overhead percpu array caches and mostly
 * lockless freeing of objects to slabs in the slowpath.
 *
 * The allocator synchronizes using spin_trylock for percpu arrays in the
 * fastpath, and cmpxchg_double (or bit spinlock) for slowpath freeing.
 * Uses a centralized lock to manage a pool of partial slabs.
 *
 * (C) 2007 SGI, Christoph Lameter
 */

SLUB organises memory into caches, each serving objects of one specific size and type. Each cache has per-CPU slab magazines for lockless fast-path allocation and a per-node partial-slab list for the slow path.

/* Create a new object cache */
struct kmem_cache *kmem_cache_create(
    const char *name,
    unsigned int size,
    unsigned int align,
    slab_flags_t flags,
    void (*ctor)(void *));

/* Allocate one object */
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags);

/* Free an object */
void kmem_cache_free(struct kmem_cache *s, void *x);

The SLUB lock order from the source:

0. cpu_hotplug_lock
1. slab_mutex          (global mutex, protects slab list)
2a. kmem_cache->cpu_sheaves->lock  (local trylock)
2b. node->barn->lock               (spinlock)
2c. node->list_lock                (spinlock)
3. slab_lock(slab)    (bit spinlock, arches without cmpxchg_double)
4. object_map_lock    (debug only)

Virtual memory and page tables

Each process has a private virtual address space described by struct mm_struct. Within that space, contiguous regions are represented as Virtual Memory Areas (struct vm_area_struct, or VMA). VMAs record permissions, backing (anonymous, file-mapped, or device), and the associated file/offset if any. Page tables translate virtual addresses to physical ones. Linux uses a multi-level page table hierarchy whose depth varies by architecture:

Architecture	Levels	Typical depth
x86-64 (4-level)	PGD → P4D → PUD → PMD → PTE	4
x86-64 (5-level)	PGD → P4D → PUD → PMD → PTE	5
AArch64	PGD → PUD → PMD → PTE	3–4
RISC-V (Sv48)	PGD → PUD → PMD → PTE	4

The TLB caches recent translations. On context switch, the kernel either flushes the TLB or, when PCID (process-context identifiers) are available on x86-64, tags TLB entries to avoid flushing.

vmalloc

vmalloc() allocates virtually contiguous but physically discontiguous memory. It is used when large contiguous physical allocations would fail due to fragmentation, but a logically contiguous virtual range is needed (e.g., for module text, large kernel buffers).

#include <linux/vmalloc.h>

void *vmalloc(unsigned long size);
void  vfree(const void *addr);

/* Physically contiguous variant (for DMA) */
void *dma_alloc_coherent(struct device *dev, size_t size,
                          dma_addr_t *dma_handle, gfp_t flag);

vmalloc memory requires TLB entries for every page in the range and incurs higher access latency than kmalloc. Prefer kmalloc/kmem_cache_alloc for small, frequently allocated objects.

NUMA support

On Non-Uniform Memory Access (NUMA) systems, memory latency depends on the distance between a CPU and the memory bank being accessed. Linux models topology as a graph of nodes, each with its own zones and free lists. The NUMA-aware page allocator tries to satisfy allocations from the requesting CPU’s local node first. alloc_pages_node() lets callers specify a target node explicitly:

/* Allocate from a specific NUMA node */
struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order);

CONFIG_NUMA_BALANCING enables automatic NUMA balancing: the kernel periodically unmaps process pages, detects which NUMA node the process is actually accessing them from, and migrates pages to the closer node.

# Inspect NUMA topology
numactl --hardware

# Bind a process to a specific node
numactl --membind=0 --cpunodebind=0 ./myprogram

Huge pages and Transparent Huge Pages

Standard pages are 4 KB. Huge pages (2 MB on x86-64, 1 GB with 1 GB pages) reduce TLB pressure significantly for large working sets by replacing 512 page table entries with a single PMD-level mapping.

HugeTLBfs
Transparent Huge Pages (THP)

Static huge pages reserved at boot or via sysctl. Processes must explicitly use mmap(MAP_HUGETLB) or mount hugetlbfs.

# Reserve 512 × 2 MB huge pages
echo 512 > /proc/sys/vm/nr_hugepages

# Use in a program
void *p = mmap(NULL, 2 << 20, PROT_READ|PROT_WRITE,
               MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);

THP (mm/huge_memory.c) allows the kernel to automatically promote naturally aligned, naturally sized anonymous mappings to 2 MB pages without application changes.

# Control THP behaviour
echo always   > /sys/kernel/mm/transparent_hugepage/enabled
echo madvise  > /sys/kernel/mm/transparent_hugepage/enabled
echo never    > /sys/kernel/mm/transparent_hugepage/enabled

# Opt in per-region from a program
madvise(addr, length, MADV_HUGEPAGE);

When a THP cannot be allocated (due to fragmentation), khugepaged periodically scans and collapses eligible 4 KB mappings into 2 MB pages in the background.

Memory reclaim and swapping

When free memory falls below watermarks, the kernel reclaims pages through the page reclaim path (mm/vmscan.c):

kswapd wakes up

Per-node kswapd kernel threads wake when free pages drop below pages_low. They scan the LRU lists to find candidate pages to reclaim.

Page aging via LRU lists

Pages move between inactive and active LRU lists. Frequently accessed pages get promoted to the active list; cold pages age toward the inactive list and become reclaim candidates.

Anonymous pages are swapped

Anonymous pages (heap, stack, mmap-anonymous) with no file backing must be written to the swap area before they can be freed. The swap subsystem (mm/swap_state.c, mm/swapfile.c) manages swap space as a block device or file.

File-backed pages are dropped or written

Clean file-backed pages can simply be dropped (re-read from disk on next access). Dirty file pages are written back by writeback threads before being freed.

OOM killer

When all reclaim efforts fail and the system has no free memory, the Out-Of-Memory (OOM) killer (mm/oom_kill.c) selects a process to kill, freeing its memory. The OOM killer scores each process based on its memory footprint (oom_score) and adjusts using a per-process tunable:

# View a process's OOM score
cat /proc/<pid>/oom_score

# Adjust OOM priority (-1000 = never kill, +1000 = kill first)
echo 500 > /proc/<pid>/oom_score_adj

Setting oom_score_adj to -1000 exempts a process from OOM killing. Use this sparingly — exempting critical daemons can cause the kernel to be forced into an even more aggressive kill if memory pressure continues.

Get Started

Kernel Internals

Development Guide

Administration

Driver Development

Physical memory model

Memory zones

Buddy allocator

SLUB allocator

Virtual memory and page tables

vmalloc

NUMA support

Huge pages and Transparent Huge Pages

Memory reclaim and swapping

OOM killer

Build docs developers (and LLMs) love

Get Started

Kernel Internals

Development Guide

Administration

Driver Development

​Physical memory model

​Memory zones

​Buddy allocator

​SLUB allocator

​Virtual memory and page tables

​vmalloc

​NUMA support

​Huge pages and Transparent Huge Pages

​Memory reclaim and swapping

​OOM killer

Build docs developers (and LLMs) love

Physical memory model

Memory zones

Buddy allocator

SLUB allocator

Virtual memory and page tables

vmalloc

NUMA support

Huge pages and Transparent Huge Pages

Memory reclaim and swapping

OOM killer