What is KV Cache?
Key-Value (KV) cache is a critical optimization technique for autoregressive transformer models. During text generation, each token attends to all previous tokens in the sequence. Without caching, this would require recomputing attention keys and values for all previous tokens at each step. The KV cache stores these computed attention keys and values, allowing the model to:- Compute attention only for the new token(s)
- Reuse cached keys and values from previous tokens
- Dramatically reduce computational cost from O(n²) to O(n) per token
Performance Impact
Without KV cache:- First token: 100ms
- Second token: 200ms (recomputes first token)
- Third token: 300ms (recomputes first two tokens)
- Total for 100 tokens: ~500 seconds
- First token: 100ms (fills cache)
- Each subsequent token: ~100ms (uses cache)
- Total for 100 tokens: ~10 seconds (50x faster!)
Why KV Cache Matters
For a transformer with:- L layers
- H attention heads
- D head dimension
- S sequence length
- B batch size
- Memory efficiency
- Batch size scaling
- Long context support
- Multi-sequence generation
KV Cache Implementation
ONNX Runtime GenAI provides several KV cache implementations (fromsrc/models/kv_cache.h):
Cache Types
DefaultKeyValueCache
DefaultKeyValueCache
Standard implementation where past and present are separate tensors.Usage:
- Most model architectures
- Default choice for CPU and most execution providers
src/models/kv_cache.h:64):CombinedKeyValueCache
CombinedKeyValueCache
Keys and values are combined in a single tensor per layer.Usage:
- Models with
past_namesconfig (combined KV format) - Some optimized model exports
[batch, 2, heads, seq, dim] where dimension 1 holds [keys, values]Implementation (from src/models/kv_cache.h:31):ModelManagedKeyValueCache
ModelManagedKeyValueCache
KV cache is managed internally by the model/execution provider.Usage:
- QNN execution provider
- Models with stateful sessions
- Custom execution providers with internal caching
src/models/kv_cache.h:130):Cache Shape Formats
From the configuration and implementation: Default (Separate K/V):Cache Management Strategies
Update Strategy
After each generation step (fromsrc/models/kv_cache.cpp):
Beam Search Reordering
Beam search requires reordering cache entries when beams are selected:Shared Buffer Optimization
For CUDA with greedy search, enable buffer sharing to reduce memory allocations (fromsrc/config.h:308):
src/generators.h:96):
- CUDA execution provider
num_beams=1(greedy search) OR Whisper model- Allocates cache to
max_lengthupfront
- Eliminates memory allocations during generation
- Enables CUDA graph capture
- Reduces latency per token
Memory Optimization
Per-Layer Cache Shapes
Models with alternating attention patterns (e.g., sliding window) can have different cache shapes per layer:Cache Pruning for Long Contexts
For sequences exceeding context length, older cache entries can be pruned:Sliding Window Attention
Fromsrc/config.h:228, models can use sliding window to limit cache size:
window_size tokens regardless of total sequence length.
Device-Specific Considerations
CUDA
Advantages:- Fast GPU memory access
- CUDA graph capture with shared buffers
- Efficient beam search reordering with kernels
DirectML
Considerations:- Inputs may be on CPU while cache is on GPU
- Separate device interfaces:
p_device_inputs_vsp_device_kvcache_
CPU
Optimization:- Use fp16 or quantized models to reduce memory bandwidth
- Consider smaller batch sizes
- Cache fits in system RAM (larger than GPU VRAM)
WebGPU
Limitations:- Inputs must be on CPU
- Cache on GPU
- Cross-device copies required
Continuous Decoding and Rewinding
The KV cache supports rewinding to enable speculative decoding and alternative paths:src/models/kv_cache.h:76):
Cross-Attention Cache
Encoder-decoder models use a separateCrossCache for encoder outputs (from src/models/kv_cache.h:109):
- Created once during encoding
- Never updated during decoding
- Shared across all decoder steps
Performance Implications
Memory vs. Compute Trade-off
With KV Cache:- ✅ Much faster generation (50-100x)
- ❌ Significant memory usage (1-4 GB per sequence)
- ❌ Limits batch size and context length
- ❌ Extremely slow (quadratic in sequence length)
- ✅ Minimal memory overhead
- ✅ Not practical for generation
Batch Size Impact
Maximum batch size is often limited by KV cache memory:Context Length Scaling
KV cache memory scales linearly with context length:Debugging KV Cache
Enable logging to track cache operations:Best Practices
Enable Shared Buffers
For CUDA with greedy search, enable
past_present_share_buffer for best performance.Next Steps
Generation
Learn about search strategies and generation parameters
Models
Explore model configuration and optimization
Performance Tuning
Optimize inference performance
API Reference
Browse the complete API documentation