Memory Management
GPU Memory Ratio
Control the fraction of GPU memory allocated to KV cache:- Default: 0.9 (90% of available memory)
- Shared GPU: Reduce to 0.7-0.8 if other processes need GPU memory
- Long contexts: Increase to 0.95 for maximum cache capacity
Page Size Configuration
Page size determines the granularity of KV cache allocation:| Use Case | Page Size | Rationale |
|---|---|---|
| Short sequences (<512 tokens) | 16-32 | Reduces internal fragmentation |
| Medium sequences (512-2048 tokens) | 64-128 | Balanced trade-off |
| Long sequences (>2048 tokens) | 256+ | Fewer page allocations |
- TensorRT-LLM: Only supports 16, 32, or 64
- FlashInfer: Works with any power-of-2 size
Number of Pages
Explicitly control the maximum number of KV cache pages:- Debugging OOM issues
- Profiling memory usage
- Running on constrained hardware
Chunked Prefill
Chunked prefill splits long prompts into smaller chunks to reduce peak memory usage and prevent OOM errors.Configuration
| Context Length | Chunk Size | Notes |
|---|---|---|
| <4K tokens | 4096-8192 | Minimal chunking needed |
| 4K-32K tokens | 8192-16384 | Balance memory and speed |
| 32K-128K tokens | 16384-32768 | Prevent OOM on most GPUs |
| >128K tokens | 32768+ | Very long context scenarios |
- Too small (<512): Significant overhead from multiple kernel launches
- Too large (>32K): Risk of OOM, especially with large batch sizes
- Optimal: Set to 2-4x your typical prompt length
CUDA Graph Optimization
CUDA graphs reduce CPU kernel launch overhead during the decode phase by capturing and replaying GPU operations.Configuration
| Workload | Max Batch Size | Rationale |
|---|---|---|
| Interactive (1-2 users) | 1-4 | Low concurrency |
| Small deployment | 16-64 | Moderate traffic |
| Production serving | 128-256 | High throughput |
| Memory constrained | 0 (disabled) | Save GPU memory |
- Higher values: Better performance at high concurrency, but more GPU memory usage
- Lower values: Less memory overhead, but may miss optimization opportunities
- Auto-tuning: Leave unset to automatically tune based on GPU memory
When to Disable
- Debugging decode kernels
- Running on very limited GPU memory (<8GB)
- Shell/interactive mode (automatically disabled)
Attention Backend Selection
Mini-SGLang supports multiple attention backends optimized for different phases:Available Backends
- fa: FlashAttention (including FlashAttention-3 on Hopper GPUs)
- fi: FlashInfer
- trtllm: TensorRT-LLM FMHA
Configuration
Recommendations by GPU Architecture
| GPU Architecture | Prefill Backend | Decode Backend | Notes |
|---|---|---|---|
| Hopper (H100, H200) | fa (FA3) | fi | Default; optimal performance |
| Ampere (A100, A10) | fa (FA2) | fi | Good balance |
| Ada (RTX 4090) | fa | fi | Consumer GPUs |
| Older (V100, T4) | fa | fa | Limited FlashInfer support |
Backend-Specific Considerations
FlashAttention:- Excellent prefill performance
- FlashAttention-3 on Hopper provides significant speedup
- Works with any page size
- Optimized for decode with paged attention
- Better performance with batched decode requests
- Requires page size to be power of 2
- Highly optimized NVIDIA kernels
- Restricts page size to 16, 32, or 64
- May require additional setup
Cache Management Strategy
Choose between Radix Cache and naive cache management:- Requests with shared prefixes (e.g., system prompts)
- Multi-turn conversations
- Batched requests with common context
- Production serving scenarios
- Benchmarking (for fair comparison)
- Debugging cache-related issues
- Workloads with no shared prefixes
Overlap Scheduling
Overlap scheduling hides CPU scheduling overhead by overlapping it with GPU computation.Configuration
Overlap scheduling is enabled by default. To disable for ablation studies:When to Disable
- Debugging scheduler behavior
- Profiling CPU overhead
- Running ablation studies
Distributed Serving (Tensor Parallelism)
Scale large models across multiple GPUs:| Model Size | GPUs | TP Size | Notes |
|---|---|---|---|
| <7B params | 1 | 1 | Single GPU sufficient |
| 7-13B params | 1-2 | 1-2 | Optional TP for speed |
| 14-34B params | 2-4 | 2-4 | TP recommended |
| 70B+ params | 4-8 | 4-8 | TP required |
- Use NVLink-connected GPUs for best performance
- TP size should divide model layers evenly
- PyNCCL is enabled by default (disable with
--disable-pyncclif needed)
Advanced Tuning
Maximum Running Requests
Control scheduler concurrency:- Higher: Better throughput under load, but more memory usage
- Lower: Reduced memory pressure, but may bottleneck under high QPS
Maximum Sequence Length
Override model’s default max sequence length:- Model supports longer context than config specifies
- Testing with shorter sequences to save memory
Data Type
Choose precision for model weights:- auto (default): FP16 for FP32/FP16 models, BF16 for BF16 models
- bfloat16: Better numerical stability on Ampere+ GPUs
- float16: Slightly faster on some GPUs
- float32: Highest precision, but 2x memory usage