Cache hierarchy
CPU caches are small, fast memory buffers that reduce the latency of accessing data from main RAM. Modern processors use multiple levels of cache with different speed and capacity tradeoffs.L1, L2, and L3 caches
L1 cache
Fastest and smallest
- Typically 32-64 KB per core
- 1-4 cycle latency
- Split into instruction and data caches
- Closest to CPU cores
L2 cache
Middle tier
- Typically 256 KB - 1 MB per core
- 10-20 cycle latency
- Unified (instructions + data)
- Private to each core
L3 cache
Largest and slowest
- Typically 8-64 MB shared
- 40-75 cycle latency
- Shared across all cores
- Last stop before RAM
For comparison, main RAM typically has 100-300 cycle latency. A cache hit at L1 is ~100x faster than going to RAM.
Cache access flow
When the CPU needs data, it searches through the cache hierarchy:L1 cache check
First check the L1 cache (separate instruction and data caches). If found (cache hit), return data immediately with 1-4 cycle latency.
L2 cache check
On L1 miss, check the L2 cache. If found, return data with ~10-20 cycle latency and backfill L1.
L3 cache check
On L2 miss, check the shared L3 cache. If found, return data with ~40-75 cycle latency and backfill L2 and L1.
Cache lines
Caches don’t store individual bytes — they work with fixed-size blocks called cache lines.Cache line fundamentals
- Concept
- Example
A cache line is the minimum unit of data transfer between cache and memory.
- Typical size: 64 bytes on modern x86/ARM processors
- Spatial locality: accessing one byte loads the entire 64-byte line
- Alignment: cache lines are aligned to 64-byte boundaries in memory
False sharing
Cache optimization techniques
Writing cache-friendly code significantly improves performance by maximizing cache hit rates.Data layout optimization
Array of Structures (AoS) vs Structure of Arrays (SoA)
Array of Structures (AoS) vs Structure of Arrays (SoA)
Array of Structures (AoS) - traditional object-oriented approach:Structure of Arrays (SoA) - cache-optimized:
Prefetching
Prefetching
Give the CPU hints about future memory accesses:Modern CPUs have automatic hardware prefetchers, but manual prefetching can help with irregular access patterns.
Loop blocking (tiling)
Loop blocking (tiling)
Break large datasets into cache-sized chunks:
Data alignment
Data alignment
Align frequently accessed data to cache line boundaries:This prevents hot data from spanning multiple cache lines.
Access pattern optimization
- Sequential access
- Strided access
- Random access
Best case: Access memory sequentially
- Maximizes cache line utilization (use all 64 bytes)
- Enables hardware prefetching
- Predictable access pattern
Working set size
Keep your program’s working set (frequently accessed data) smaller than cache capacity:- L1-sized (32-64 KB): Ultra-hot paths, critical loops
- L2-sized (256 KB - 1 MB): Hot data structures
- L3-sized (8-64 MB): Warm data, larger working sets
- Beyond L3: Cold data, cache misses inevitable
Performance impact
Cache optimization can provide dramatic performance improvements:| Optimization | Typical Speedup |
|---|---|
| Sequential vs random array access | 10-50x |
| SoA vs AoS for SIMD workloads | 2-4x |
| Eliminating false sharing | 5-20x |
| Loop blocking for matrix operations | 2-10x |
| Proper data alignment | 1.2-2x |
The exact improvements depend on your specific workload, data sizes, and hardware. Always measure before and after optimization.