Achieving High Throughput for Offline Batch Inference
Achieving a large batch size is the most important factor for attaining high throughput in offline batch inference. When the server is running at full load in a steady state, look for log entries like:Control Queue Size
Number of requests in the queue. A healthy range is 100-2000.
#queue-req: 0frequently: Client code is submitting requests too slowly. Increase request submission rate.#queue-req > 2000frequently: Too many queued requests increase scheduling overhead. Reduce request submission rate.
Maximize Token Usage
KV cache memory utilization of the server. Target: > 0.9 for good utilization.
-
token usage < 0.9and#queue-req > 0frequently: Server is too conservative about taking new requests.- Solution: Decrease
--schedule-conservativenessto a value like0.3 - Common cause: Users send many requests with large
max_new_tokensbut requests stop early due to EOS or stop strings
- Solution: Decrease
-
token usagevery high and frequent warnings like:- Solution: Increase
--schedule-conservativenessto a value like1.3 - Note: Occasional retractions (~1 time per minute) are acceptable
- Solution: Increase
Tune Memory Allocation
SGLang allocates memory as follows:Determines memory allocation for the first two components:
--mem-fraction-static as high as possible while reserving enough memory for activations and CUDA graph buffers.
Optimization Process:
-
Check available GPU memory in logs before server is ready:
-
Evaluate
available_gpu_mem:- 5-8 GB: Good setting
- 10-20 GB: Too high, increase
--mem-fraction-staticto allocate more to KV cache - < 5 GB: Too low, risk of OOM errors, decrease
--mem-fraction-static
-
Alternative approach: Increase
--mem-fraction-staticin increments of 0.01 until you encounter OOM errors for your workloads
Avoid Out-of-Memory Errors
If you encounter OOM errors:OOM During Prefill
OOM During Prefill
- Reduce
--chunked-prefill-sizeto4096or2048 - Tradeoff: Saves memory but slows down prefill for long prompts
OOM During Decoding
OOM During Decoding
- Lower
--max-running-requests - Tradeoff: Limits maximum concurrency
General OOM
General OOM
- Reduce
--mem-fraction-staticto0.8or0.7 - Tradeoff: Decreases KV cache capacity, limits peak throughput
Tune CUDA Graph Coverage
Maximum batch size for CUDA graph capture. Default varies by model (typically 160-256).
- Increase
--cuda-graph-max-bsto a larger value (e.g., 512, 768) - Important: CUDA graph consumes more memory, so reduce
--mem-fraction-staticat the same time
Optimize Parallelism Strategy
Data parallelism size. Better for throughput than tensor parallelism when GPU memory allows.
Tensor parallelism size. Required for large models that don’t fit on a single GPU.
- Data parallelism is better for throughput: When there is enough GPU memory, always favor data parallelism
- Use SGLang Model Gateway: For better data parallelism management rather than using
--dp-sizeparameter - Tensor parallelism: Use only when model doesn’t fit on a single GPU
Additional Optimizations
Torch Compile
Accelerates small models on small batch sizes.
FP8 Quantization
Reduces memory footprint and improves throughput.
Expert Parallelism
For MoE models, distribute experts across GPUs.See Expert Parallelism blog
DP Attention
For DeepSeek models with data parallelism.
Use Longest Prefix Match Scheduling
Scheduling policy for requests. Options:
fcfs, lpm.lpm(Longest Prefix Match) reorders requests to encourage more cache hits- Introduces more scheduling overhead
- Best for workloads with high prefix reuse (e.g., many similar prompts)
Optimizing for Different Workloads
Online Serving (Low Latency)
Priorities: Low latency, consistent response timesUse Fast Attention Backend
Choose the fastest backend for your hardware:
- Hopper:
--attention-backend fa3 - Blackwell:
--attention-backend trtllm_mhaor--attention-backend trtllm_mla - Ampere/Ada:
--attention-backend flashinfer
Offline Batch Processing (High Throughput)
Priorities: Maximum throughput, high GPU utilizationLong-Context Workloads
Priorities: Support long sequences, maximize prefix reuseMulti-turn Conversations
Priorities: Reuse conversational context, low latency for follow-upsMonitoring and Metrics
Key Metrics to Monitor
Generation throughput in tokens per second. Primary metric for performance.
KV cache memory utilization. Target: > 0.9 for good utilization.
Number of requests currently being processed. Should be close to
--max-running-requests under load.Number of requests in the queue. Healthy range: 100-2000.
Whether CUDA graph is active for the current batch. Should be
True for small batches.Enable Metrics Collection
- Prometheus endpoint:
http://localhost:30000/metrics - Cache report: Periodic logs showing cache hit rates
Troubleshooting Performance Issues
Low Throughput
Symptoms:gen throughput significantly lower than expected
Check Token Usage
Check Token Usage
If
token usage < 0.9:- Decrease
--schedule-conservativeness - Increase
--mem-fraction-static
Check Queue Size
Check Queue Size
If
#queue-req: 0 frequently:- Increase request submission rate
- Client is the bottleneck, not the server
Check CUDA Graph
Check CUDA Graph
If
cuda graph: False for small batches:- Increase
--cuda-graph-max-bs - Verify CUDA graph is not disabled
Check Attention Backend
Check Attention Backend
- Verify using optimal backend for your hardware
- Try different backends and benchmark
High Latency
Symptoms: Requests take longer than expected to completeCheck Batch Size
Check Batch Size
If batch size is too large:
- Reduce
--max-running-requests - Trade throughput for lower latency
Check Queue Size
Check Queue Size
If
#queue-req is very high:- Reduce request submission rate
- Requests are waiting too long in queue
Check Prefill Time
Check Prefill Time
If long prompts dominate:
- Increase
--chunked-prefill-size - Enable Piecewise CUDA Graph
Memory Issues
Symptoms: OOM errors, frequent retractionsReduce Memory Pressure
Reduce Memory Pressure
- Decrease
--mem-fraction-static - Reduce
--cuda-graph-max-bs - Reduce
--chunked-prefill-size
Increase Memory Efficiency
Increase Memory Efficiency
- Enable quantized KV cache:
--kv-cache-dtype fp8_e4m3 - Use FP8 weight quantization:
--quantization fp8
Adjust Scheduling
Adjust Scheduling
If frequent retractions:
- Increase
--schedule-conservativeness
Best Practices Summary
Start Conservative
Begin with default settings and tune incrementally based on metrics.
Monitor Metrics
Enable metrics and cache reporting to make data-driven tuning decisions.
Workload-Specific Tuning
Optimize for your specific workload characteristics (online vs. offline, long vs. short context).
Benchmark Regularly
Test performance after each configuration change to validate improvements.
Balance Tradeoffs
Understand the tradeoffs between latency, throughput, and memory usage.
Use Latest Features
Leverage HiCache, PCG, and optimized attention backends for best performance.
