Overview
The KV cache is a pool of blocks that can hold KV state for a fixed number of tokens. Key features include:- Paged memory management with block-based allocation
- Cross-request reuse via radix tree search
- Prioritized eviction with configurable retention policies
- Offloading to host memory for extended capacity
- Multiple data types (FP16, BF16, FP8, INT8, NVFP4)
- Variable attention windows and grouped query attention support
Architecture
Block-Based Memory
The KV cache divides memory into fixed-size blocks:- Each block holds KV state for a fixed number of tokens (configurable, must be power of 2)
- Multiple layers are packed within a single block
- Blocks are assigned to requests as needed
- Separate pools for different attention window sizes and head counts
All layers in a pool must have the same number of heads and attention window size. Multiple pools are created automatically to support models with varying configurations.
Radix Tree for Reuse
Blocks are stored in a radix search tree as they are filled:- New requests search the tree for matching prefixes
- Matched blocks are reused instead of recalculated
- Blocks can be shared among multiple requests
- Saves both memory and computation
Configuration
Memory Allocation
Control how much GPU memory is allocated to KV cache:If both
free_gpu_memory_fraction and max_tokens are set, the lesser of the two is used.Data Type
Specify the KV cache data type for memory-performance tradeoffs:- FP16/BF16
- FP8
- INT8
- NVFP4
- Default precision
- Best accuracy
- Highest memory usage
Host Memory Offloading
Extend effective cache capacity by offloading to CPU memory:How Offloading Works
How Offloading Works
- When a GPU block is evicted, it’s first copied to host memory
- The block remains reusable from host memory
- When reused, it’s copied back to GPU memory
- Blocks below
secondary_offload_min_priorityare evicted directly (not offloaded)
When to Use Offloading
When to Use Offloading
- Long-running sessions with high reuse potential
- Scenarios with repeating prompts or patterns
- When GPU memory is limited but CPU memory is abundant
- Trade PCIe bandwidth for additional cache capacity
Block Retention and Eviction
Priority-Based Eviction
Blocks are assigned priority scores (0-100, higher = more important):- Lowest priority blocks are evicted first
- Only leaf blocks (no descendants in radix tree) can be evicted
- Prioritized LRU (Least Recently Used) within each priority level
Priority reverts to the default (35) after
duration_ms elapses from when the block first becomes available for reuse.Partial Reuse
Partial block matching enables more flexible reuse:copy_on_partial_reuse=True (Default)
copy_on_partial_reuse=True (Default)
- Creates a new block and copies matched tokens
- Allows multiple requests to partially reuse the same block
- Higher memory usage but more flexible
copy_on_partial_reuse=False
copy_on_partial_reuse=False
- Reuses block in-place (no copy)
- Only works if no other request is using the block
- Lower memory usage but less flexible
Security Features
KV Cache Salting
Cache salting prevents unauthorized reuse of cached KV states:Multimodal UUID Support
For multimodal models, custom UUIDs enable deterministic cache management:Cache Correctness: When a UUID is provided, the cache key is computed from both the UUID and content using
BLAKE3(UUID || Content). This ensures:- Different content always produces different cache entries
- Same content with different UUIDs produces different entries (user isolation)
- Original UUID is preserved in KV cache events for external tracking
Advanced Features
Attention Window Size
Configure per-layer attention window sizes:If the list length is less than the number of layers, the pattern repeats. For example,
[4096, 256] means layer 0 has full attention (4096), layer 1 has sliding window (256), layer 2 has full attention, etc.Grouped Query Attention (GQA)
TensorRT-LLM automatically optimizes KV cache for GQA/MQA:- MHA (Multi-Head Attention): One group per head
- MQA (Multi-Query Attention): Single group for all heads
- GQA (Grouped Query Attention): Intermediate grouping
Streaming and Long Context
Support for models with limited attention windows:Cyclic KV Cache
Cyclic KV Cache
- Treats KV cache as a circular buffer
- Stores only the last N tokens (N = attention_window_size)
- New tokens overwrite least recently used cache
- Reduces memory for very long sequences
StreamingLLM
StreamingLLM
- Keeps first S “sink tokens” always in cache
- Applies sliding window to remaining tokens
- Uses positions within cache (not original text) for RoPE
- Enables efficient long-text generation
Cross-Request Reuse Example
Here’s how to maximize KV cache reuse across requests:Best Practices
Memory Configuration
Memory Configuration
- Start with
free_gpu_memory_fraction=0.9(default) - Use FP8 KV cache on Hopper+ GPUs for 2x memory savings
- Enable host offloading for multi-turn conversations
- Monitor cache hit rates and adjust retention policies
Retention Policies
Retention Policies
- Assign high priority (80-90) to system prompts and common prefixes
- Use medium priority (50-60) for user-specific context
- Set low priority (20-30) for temporary or one-time prompts
- Use
duration_ms=Nonefor prompts that should never expire
Security Considerations
Security Considerations
- Always use
cache_saltin multi-tenant environments - Set different salts per user, session, or security domain
- Use multimodal UUIDs for deterministic cache management
- Monitor for unusual cache hit patterns (potential attacks)
Additional Resources
KV Cache API Reference
Complete API documentation for KvCacheConfig
Retention Config Example
Advanced retention policy examples
Host Offloading Example
Complete host offloading implementation