Overview
SGLang includesbench_serving.py, a comprehensive benchmarking tool for measuring serving performance under various load patterns. It supports multiple backends, datasets, and request distributions.
Quick Start
Basic Benchmark
Run a simple benchmark with random prompts:Full Benchmark Example
Command-Line Options
Backend Configuration
--backend (required)
Specify the serving backend:
sglang: Native SGLang/generateendpointsglang-oai: OpenAI-compatible completions APIsglang-oai-chat: OpenAI-compatible chat completions APIvllm,vllm-chat,lmdeploy,lmdeploy-chat: Other backends
--host
Server endpoint (default: http://localhost:30000)
Dataset Options
--dataset-name
Dataset to use for benchmarking:
random: Randomly generated promptssharegpt: ShareGPT conversation datasetarxiv: ArXiv paper abstractsmooncake: Time-based trace replay
--dataset-path
Path to custom dataset file (JSON format)
Random Dataset Options
--random-input
Number of input tokens for random prompts (default: 1024)
--random-output
Number of output tokens for random prompts (default: 128)
--random-range-ratio
Randomness range ratio (default: 1.0)
- Sets token length variance: ±
ratio * length / 2 - Example:
--random-range-ratio 0.5with--random-input 1024gives range [768, 1280]
Load Configuration
--num-prompts
Number of requests to send (required)
--request-rate
Request rate in requests/second (default: inf - send all at once)
- Use
inffor throughput benchmarks - Use finite values (e.g.,
10) for latency benchmarks
--max-concurrency
Maximum number of concurrent requests (default: unlimited)
--warmup-requests
Number of warmup requests before benchmark (default: 1)
Output Options
--output-file
Path to save detailed results (JSON format)
--disable-tqdm
Disable progress bar
--plot-throughput
Plot throughput over time (requires termplotlib and gnuplot)
Advanced Options
--disable-stream
Disable streaming responses (non-streaming mode)
--disable-ignore-eos
Respect EOS tokens (default: ignore EOS for consistent output lengths)
--return-logprob
Return log probabilities (SGLang native API only)
--return-routed-experts
Return routed expert information for MoE models
--extra-request-body
JSON string with additional request parameters
--header
Custom HTTP headers (format: key:value)
Multi-Turn Chat
--multi-turn
Enable multi-turn conversation mode (chat backends only)
--num-turns
Number of turns per conversation (default: 1)
LoRA Benchmarking
--lora-name
LoRA adapter names (space-separated list)
--lora-request-distribution
Distribution of LoRA requests:
uniform: Randomly select from all adaptersdistinct: Round-robin through adaptersskewed: Zipf distribution (use with--lora-zipf-alpha)
--lora-zipf-alpha
Alpha parameter for Zipf distribution (default: 1.0)
Benchmark Metrics
Output Metrics
After completion,bench_serving reports:
Throughput:
request_throughput: Requests per secondinput_throughput: Input tokens per secondoutput_throughput: Output tokens per secondtotal_throughput: Total tokens per secondmax_output_tokens_per_s: Peak token generation rate
mean_ttft_ms: Mean time to first tokenmedian_ttft_ms: Median TTFTp99_ttft_ms: 99th percentile TTFTmean_tpot_ms: Mean time per output tokenmean_itl_ms: Mean inter-token latencyp99_itl_ms: 99th percentile ITLmean_e2e_latency_ms: Mean end-to-end latencyp99_e2e_latency_ms: 99th percentile E2E latency
completed: Number of successful requestsconcurrency: Average concurrent requestsmax_concurrent_requests: Peak concurrent requests
Example Output
Common Benchmark Scenarios
1. Maximum Throughput
Send all requests simultaneously to measure peak throughput:2. Sustained Load Test
Test steady-state performance with fixed request rate:3. Latency Benchmark
Measure single-request latency with minimal concurrency:4. Long Context Benchmark
Test performance with long input contexts:5. Prefix Caching Effectiveness
Benchmark with shared prefixes to measure cache hit rates:random-range-ratio creates similar prompts, increasing cache hits.
6. Multi-Turn Conversation
Benchmark chat completions with multiple turns:7. LoRA Adapter Performance
Test LoRA adapter switching overhead:8. Trace Replay
Replay production traffic patterns with Mooncake dataset:Profiling
Enable PyTorch profiling during benchmarks:--profile: Enable profiling--profile-output-dir: Directory for profile traces--profile-num-steps: Number of steps to profile--profile-by-stage: Profile by processing stage--profile-stages: Specific stages to profile--profile-activities: Activities to track (e.g.,cpu,cuda)
Disaggregated Mode
For prefill-decode disaggregation, profile both workers:Interpreting Results
Good Performance Indicators
- TTFT: <100ms for short contexts, <500ms for long contexts
- TPOT: 10-20ms for typical models
- ITL: Low variance (std < mean)
- Throughput: Scales with batch size and concurrency
- Cache hit rate: >50% for production workloads with repeated patterns
Performance Issues
High TTFT:- Large batch size (queue depth)
- Long input contexts
- Memory allocation delays
- Low batch size (GPU underutilization)
- Model size vs hardware mismatch
- Memory bandwidth bottleneck
- Scheduler preemption
- Mixed request sizes
- Cache eviction
- Too few concurrent requests
- Small batch sizes
- CPU bottleneck (tokenization)
Comparing Backends
Benchmark multiple backends with the same workload:Custom Datasets
Create a custom dataset file (JSON lines format):Continuous Benchmarking
For production monitoring, run periodic benchmarks:Best Practices
- Warmup: Always use
--warmup-requeststo exclude cold-start effects - Multiple runs: Run benchmarks 3-5 times and average results
- Representative workloads: Use datasets matching your production traffic
- Metrics collection: Enable
--enable-metricson server during benchmarks - System isolation: Run benchmarks on dedicated hardware when possible
- Network latency: Co-locate benchmark client and server to isolate serving performance
- Monitor resources: Watch GPU/CPU/memory utilization during benchmarks
Troubleshooting
Connection Errors
Authentication
Set API key:High Error Rate
- Reduce
--request-rateor--max-concurrency - Increase server
--max-running-requests - Check server logs for OOM errors
Next Steps
- Set up monitoring to track metrics during benchmarks
- Review available metrics to analyze results
- Enable tracing for detailed request analysis
