PagedAttention is based on the paper “Efficient Memory Management for Large Language Model Serving with PagedAttention” by Kwon et al., published at SOSP 2023.
The memory bottleneck problem
During LLM inference, the KV (key-value) cache is the primary memory bottleneck:- Each token in a sequence requires storing attention keys and values
- For a Llama-13B model, storing KV cache for one token requires ~1.5MB
- A single sequence of 2048 tokens needs ~3GB just for KV cache
- Traditional implementations allocate contiguous memory, leading to fragmentation
- Memory must be pre-allocated for the maximum sequence length
- Internal fragmentation wastes ~60% of memory
- External fragmentation prevents memory sharing
- Batch size is severely limited
How PagedAttention works
PagedAttention divides the KV cache into fixed-size blocks, allowing non-contiguous storage in GPU memory.Block-based memory layout
The KV cache is divided into blocks, each storing keys and values for a fixed number of tokens:- Block size - Typically 16 or 32 tokens per block
- Block structure - Each block stores
[block_size, num_heads, head_dim]for keys and values - Non-contiguous allocation - Blocks can be stored anywhere in GPU memory
vllm/v1/core/block_pool.py, vllm/v1/core/kv_cache_utils.py:13
Memory layout structure
The actual KV cache tensors have the following shapes:- Memory coalescing - Neighboring threads read neighboring memory
- Efficient access patterns - Thread groups process data together
- CUDA optimization - Aligned with GPU warp execution
csrc/attention/attention_kernels.cu, documented in docs/design/paged_attention.md:49
Block allocation and management
TheKVCacheManager handles dynamic block allocation:
Location in codebase:
vllm/v1/core/kv_cache_manager.py:94, vllm/v1/core/sched/scheduler.py:63
Key benefits
1. Near-zero memory waste
PagedAttention reduces memory waste from ~60% to less than 4%:- Internal fragmentation - Only the last block may be partially filled
- External fragmentation - Eliminated through non-contiguous allocation
- Over-provisioning - No need to pre-allocate for max sequence length
2. Memory sharing for parallel sampling
Multiple sequences can share prefix blocks:- Parallel sampling (sampling
n > 1completions) - Beam search
- Speculative decoding
3. Prefix caching
Common prompt prefixes automatically share blocks across different requests:docs/design/memory-management.md, vllm/v1/core/kv_cache_coordinator.py
4. Higher throughput
By eliminating memory waste, PagedAttention enables:- Larger batch sizes - More requests processed simultaneously
- Continuous batching - New requests join ongoing batches
- Better GPU utilization - Less memory waste means more compute
Implementation details
Block pool structure
The block pool maintains free and allocated blocks:vllm/v1/core/block_pool.py
KVCacheBlocks interface
The scheduler interacts with KV cache through theKVCacheBlocks abstraction:
vllm/v1/core/kv_cache_manager.py:21
Attention kernel execution
The custom attention kernel (paged_attention_kernel) handles non-contiguous blocks:
Key concepts:
- Thread group - Small group of threads processing one query and one key token
- Warp - 32 threads processing one query token against all context keys in one block
- Thread block - Multiple warps processing one entire context sequence
- Load query token into shared memory
- Iterate through blocks of key tokens
- Compute QK dot products with softmax
- Iterate through value blocks
- Compute weighted sum and output
csrc/attention/attention_kernels.cu, docs/design/paged_attention.md
Hash-based prefix caching
vLLM uses content hashing to identify shared prefixes:vllm/v1/core/kv_cache_utils.py, vllm/utils/hashing.py
Configuration
Block size
Block size affects memory efficiency and performance:KV cache configuration
TheKVCacheConfig controls cache behavior:
vllm/v1/kv_cache_interface.py
Memory profiling
vLLM automatically profiles GPU memory to determine the number of available blocks:vllm/v1/engine/core.py:120
Performance characteristics
Memory efficiency
Compared to naive implementation:- Traditional: ~40% memory utilization (60% waste)
- PagedAttention: ~96% memory utilization (4% waste)
- Traditional: Can fit ~160 requests
- PagedAttention: Can fit ~400 requests (2.5x improvement)
Throughput improvements
Measured throughput gains:- 2-4x higher throughput compared to naive implementations
- Near-linear scaling with batch size (up to memory limits)
- Enables continuous batching for maximum GPU utilization
Advanced features
Hybrid KV cache management
vLLM supports multiple cache types in the same deployment:- GPU blocks - Fast access for active requests
- CPU blocks - Offloading for lower-priority requests
- Hybrid strategies - Automatic migration between GPU/CPU
docs/design/hybrid_kv_cache_manager.md
KV cache events
For distributed scenarios, vLLM can publish KV cache events:vllm/distributed/kv_events.py
Context parallelism integration
PagedAttention works seamlessly with context parallelism:- Blocks are partitioned across context parallel ranks
- Each rank manages its subset of blocks
- Attention computation distributed across ranks
vllm/v1/core/sched/scheduler.py:149
Comparison with alternatives
| Approach | Memory Efficiency | Flexibility | Complexity |
|---|---|---|---|
| Contiguous allocation | Low (40%) | Low | Low |
| PagedAttention | High (96%) | High | Medium |
| FlashAttention | Medium (80%) | Medium | Low |
| PagedAttention + Flash | High (96%) | High | Medium |
vLLM integrates both PagedAttention (for memory management) and FlashAttention/FlashInfer (for kernel efficiency) to get the best of both approaches.
Next steps
- Explore Model execution to understand how requests flow through the system
- Learn about Prefix caching for advanced memory sharing
- Review Performance optimization for tuning block size and cache settings