Memory Management
This guide covers SGLang’s memory management system, including KV cache allocation, radix caching, and memory optimizations.Overview
SGLang’s memory system manages:- Model weights (static, loaded once)
- KV cache (dynamic, per request)
- Activation memory (temporary, per batch)
- Workspace buffers (scratch space for kernels)
python/sglang/srt/mem_cache/
Memory Layout
GPU Memory Breakdown
Memory Allocation at Startup
Token-to-KV Pool
Architecture
TheTokenToKVPool manages KV cache allocation at token granularity.
Per-Token KV Cache Size
For a model with:num_layers = 32num_kv_heads = 8(GQA)head_dim = 128dtype = fp16(2 bytes)
10,000 * 128 KB = 1.28 GB
Radix Cache
Radix Tree for Prefix Sharing
Radix cache uses a tree structure to share KV cache across requests with common prefixes.Example: Prefix Sharing
Memory Allocation Strategies
Lazy Allocation
Allocate KV cache incrementally as tokens are generated:Eager Eviction
Free cache immediately when request finishes:Cache Eviction Policy
When memory is full, evict least recently used (LRU) cached prefixes:KV Cache Formats
Contiguous Format
All tokens’ KV cache stored contiguously:Paged Format
KV cache split into fixed-size pages (e.g., PagedAttention):Memory Optimizations
1. Quantized KV Cache
Store KV cache in lower precision:2. HiCache (L3 Storage)
Offload cold KV cache to CPU or SSD:3. Multi-Query Attention (MQA)
Reduce KV cache size by sharing across query heads:Monitoring and Debugging
Memory Usage Statistics
Visualize Memory Usage
Best Practices
1. Set Appropriate Memory Fraction
2. Enable RadixCache
3. Use Chunked Prefill
Resources
Next Steps
- Scheduler - How requests are batched
- Architecture Overview - System design
- Kernel Development - Optimize kernels
