Overview
Mini-SGLang supports CUDA graph capture and replay to minimize CPU overhead during decoding. This feature is enabled by default and provides significant performance improvements for high-throughput serving.How It Works
- Capture Phase: During initialization, Mini-SGLang captures CUDA operations for different batch sizes
- Replay Phase: During inference, captured graphs are replayed instead of launching individual CUDA kernels
- CPU Overhead Reduction: Eliminates kernel launch overhead, allowing CPU to prepare the next batch while GPU executes
Configuration
Use the--cuda-graph-max-bs flag to control CUDA graph behavior:
Maximum batch size for CUDA graph capture.Alias:
--graph- When not specified (or
null): Auto-tuned based on available GPU memory - When set to
0: Disables CUDA graph optimization entirely - When set to a positive integer: Captures graphs for batch sizes up to this value
minisgl.server.args.py:149Batch Size Selection
Mini-SGLang captures CUDA graphs for multiple batch sizes to handle varying workloads efficiently. The specific batch sizes are determined by:- Maximum Batch Size: Controlled by
--cuda-graph-max-bs - Auto-tuning: Based on GPU memory availability when not explicitly set
- Power-of-2 Strategy: Typically captures graphs for powers of 2 (e.g., 1, 2, 4, 8, 16, 32)
Implementation Details
Each attention backend implements CUDA graph support:FlashAttention Backend
Source:minisgl.attention.fa.FlashAttentionBackend (source:~/workspace/source/python/minisgl/attention/fa.py:107)
FlashInfer Backend
Source:minisgl.attention.fi.FlashInferBackend (source:~/workspace/source/python/minisgl/attention/fi.py:222)
TensorRT-LLM Backend
Source:minisgl.attention.trtllm.TensorRTLLMBackend (source:~/workspace/source/python/minisgl/attention/trtllm.py:131)
Performance Benefits
CUDA graph optimization provides several benefits:1. Reduced CPU Overhead
Without CUDA graph, each CUDA kernel launch requires CPU involvement:- Kernel parameter setup
- Driver API calls
- Synchronization overhead
2. Overlap Scheduling
CUDA graphs enable overlap scheduling (from NanoFlow):- CPU prepares the next batch while GPU executes current batch
- Improves overall system throughput
- Reduces end-to-end latency
3. Deterministic Performance
Captured graphs provide consistent execution times, making performance more predictable.Usage Examples
High-Throughput Serving
For maximum throughput, enable CUDA graph with a large batch size:Memory-Constrained Serving
Reduce CUDA graph batch size to save memory:Debugging or Development
Disable CUDA graph for easier debugging:Shell Mode
Shell mode automatically sets--cuda-graph-max-bs 1 for interactive use:
Memory Considerations
CUDA graphs require additional GPU memory:- Graph Storage: Captured graphs consume memory
- Multiple Batch Sizes: Each batch size requires a separate graph
- Buffer Allocation: Pre-allocated buffers for maximum batch size
- Higher
--cuda-graph-max-bs: Better performance, more memory usage - Lower
--cuda-graph-max-bs: Less memory usage, potentially lower throughput --cuda-graph-max-bs 0: Minimum memory, no CUDA graph benefits
Auto-Tuning
When--cuda-graph-max-bs is not specified, Mini-SGLang auto-tunes based on:
- Available GPU memory after model loading
- KV cache allocation (controlled by
--memory-ratio) - Estimated graph memory overhead
- Model size and architecture
Compatibility
Attention Backends
All attention backends support CUDA graph:- ✅ FlashAttention (
fa) - ✅ FlashInfer (
fi) - ✅ TensorRT-LLM (
trtllm)
GPU Architectures
CUDA graph is supported on:- NVIDIA Ampere (A100, A10, etc.)
- NVIDIA Hopper (H100, H200, etc.)
- NVIDIA Ada Lovelace (RTX 4090, L40S, etc.)
- Older architectures with CUDA 10.0+
Limitations
- Only applies to decode phase (single-token generation)
- Prefill phase (multi-token processing) does not use CUDA graph
- Requires deterministic batch sizes during capture
Related Features
CUDA graph works in conjunction with:- Overlap Scheduling: Maximizes GPU utilization by overlapping CPU and GPU work
- Chunked Prefill: Splits long prompts to enable efficient batching (see
--max-prefill-length) - Radix Cache: Reuses KV cache across requests (see Cache Management)
Troubleshooting
Out of Memory Errors
If you encounter OOM errors, reduce CUDA graph batch size:Performance Degradation
If disabling CUDA graph improves performance, you may have:- Very small batch sizes (< 4)
- Highly variable request patterns
- Memory pressure causing swapping