Skip to main content

Cache hierarchy

CPU caches are small, fast memory buffers that reduce the latency of accessing data from main RAM. Modern processors use multiple levels of cache with different speed and capacity tradeoffs.

L1, L2, and L3 caches

L1 cache

Fastest and smallest
  • Typically 32-64 KB per core
  • 1-4 cycle latency
  • Split into instruction and data caches
  • Closest to CPU cores

L2 cache

Middle tier
  • Typically 256 KB - 1 MB per core
  • 10-20 cycle latency
  • Unified (instructions + data)
  • Private to each core

L3 cache

Largest and slowest
  • Typically 8-64 MB shared
  • 40-75 cycle latency
  • Shared across all cores
  • Last stop before RAM
For comparison, main RAM typically has 100-300 cycle latency. A cache hit at L1 is ~100x faster than going to RAM.

Cache access flow

When the CPU needs data, it searches through the cache hierarchy:
1

L1 cache check

First check the L1 cache (separate instruction and data caches). If found (cache hit), return data immediately with 1-4 cycle latency.
2

L2 cache check

On L1 miss, check the L2 cache. If found, return data with ~10-20 cycle latency and backfill L1.
3

L3 cache check

On L2 miss, check the shared L3 cache. If found, return data with ~40-75 cycle latency and backfill L2 and L1.
4

Main memory access

On L3 miss, fetch from main RAM with ~100-300 cycle latency. Update all cache levels.
CPU Core

┌──────────┐  Hit: 1-4 cycles
│ L1 Cache │  32-64 KB
└──────────┘
   ↓ Miss
┌──────────┐  Hit: 10-20 cycles
│ L2 Cache │  256KB - 1MB
└──────────┘
   ↓ Miss
┌──────────┐  Hit: 40-75 cycles
│ L3 Cache │  8-64 MB (shared)
└──────────┘
   ↓ Miss
┌──────────┐  Hit: 100-300 cycles
│ Main RAM │  8-64 GB
└──────────┘

Cache lines

Caches don’t store individual bytes — they work with fixed-size blocks called cache lines.

Cache line fundamentals

A cache line is the minimum unit of data transfer between cache and memory.
  • Typical size: 64 bytes on modern x86/ARM processors
  • Spatial locality: accessing one byte loads the entire 64-byte line
  • Alignment: cache lines are aligned to 64-byte boundaries in memory
When you access a single byte, the CPU fetches the entire surrounding 64-byte block into cache, betting you’ll soon access nearby data.

False sharing

False sharing occurs when different CPU cores modify variables that happen to reside on the same cache line, causing excessive cache coherency traffic and severe performance degradation.
// BAD: False sharing
struct Counter {
    int count_thread1;  // 
    int count_thread2;  // ← Same cache line!
};

// Thread 1 modifies count_thread1
// Thread 2 modifies count_thread2
// Result: constant cache line ping-pong between cores
// GOOD: Padded to separate cache lines
struct Counter {
    int count_thread1;
    char padding[60];   // Push to different cache lines
    int count_thread2;
};

// Now each counter is on its own cache line
// No false sharing!

Cache optimization techniques

Writing cache-friendly code significantly improves performance by maximizing cache hit rates.

Data layout optimization

Array of Structures (AoS) - traditional object-oriented approach:
struct Point {
    float x, y, z;
};
Point points[1000];

// Processing: loads all fields even if only need x
for (int i = 0; i < 1000; i++) {
    sum += points[i].x;  // Loads x, y, z into cache
}
Structure of Arrays (SoA) - cache-optimized:
struct Points {
    float x[1000];
    float y[1000];
    float z[1000];
};
Points points;

// Processing: only loads needed data
for (int i = 0; i < 1000; i++) {
    sum += points.x[i];  // Only loads x values
}
Use SoA when you frequently process one field across many objects. Use AoS when you typically need all fields together.
Give the CPU hints about future memory accesses:
for (int i = 0; i < n; i++) {
    // Prefetch next iterations' data
    __builtin_prefetch(&array[i + 8]);
    
    // Process current data
    process(array[i]);
}
Modern CPUs have automatic hardware prefetchers, but manual prefetching can help with irregular access patterns.
Break large datasets into cache-sized chunks:
// BAD: processes entire rows, poor cache reuse
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += matrix[i][j];

// GOOD: processes cache-sized blocks
for (int ii = 0; ii < N; ii += BLOCK_SIZE)
    for (int jj = 0; jj < N; jj += BLOCK_SIZE)
        for (int i = ii; i < min(ii + BLOCK_SIZE, N); i++)
            for (int j = jj; j < min(jj + BLOCK_SIZE, N); j++)
                sum += matrix[i][j];
Align frequently accessed data to cache line boundaries:
// Align to 64-byte cache line
struct alignas(64) HotData {
    int frequently_accessed[16];
};

// Or using attributes
__attribute__((aligned(64))) int hot_array[16];
This prevents hot data from spanning multiple cache lines.

Access pattern optimization

Best case: Access memory sequentially
// GOOD: Sequential access
for (int i = 0; i < n; i++) {
    sum += array[i];  // Perfect cache line utilization
}
  • Maximizes cache line utilization (use all 64 bytes)
  • Enables hardware prefetching
  • Predictable access pattern

Working set size

Keep your program’s working set (frequently accessed data) smaller than cache capacity:
  • L1-sized (32-64 KB): Ultra-hot paths, critical loops
  • L2-sized (256 KB - 1 MB): Hot data structures
  • L3-sized (8-64 MB): Warm data, larger working sets
  • Beyond L3: Cold data, cache misses inevitable
Profile your code with cache performance counters (using perf stat -e cache-misses,cache-references on Linux) to identify cache bottlenecks.

Performance impact

Cache optimization can provide dramatic performance improvements:
OptimizationTypical Speedup
Sequential vs random array access10-50x
SoA vs AoS for SIMD workloads2-4x
Eliminating false sharing5-20x
Loop blocking for matrix operations2-10x
Proper data alignment1.2-2x
The exact improvements depend on your specific workload, data sizes, and hardware. Always measure before and after optimization.

Build docs developers (and LLMs) love