vllm bench command provides benchmarking utilities to measure vLLM performance across different scenarios.
Available benchmarks
Subcommands
throughput- Measure throughput (tokens/second)latency- Measure end-to-end latencyserve- Benchmark a running serverstartup- Measure engine startup time
Throughput benchmark
Measures the maximum throughput (tokens per second) that vLLM can achieve.Examples
Common options
Model name or path to benchmark.
Input prompt length in tokens.
Output sequence length in tokens.
Number of prompts to benchmark.
Number of GPUs for tensor parallelism.
Data type:
auto, float16, bfloat16, or float32.Latency benchmark
Measures end-to-end latency for text generation.Examples
Key metrics
The latency benchmark reports:- First token latency: Time to generate the first token (TTFT)
- Inter-token latency: Time between subsequent tokens
- End-to-end latency: Total generation time
- P50, P90, P95, P99: Latency percentiles
Server benchmark
Benchmarks a running vLLM server via HTTP requests.Examples
Options
Base URL of the vLLM server.
Number of requests to send.
Request rate (requests per second). If not specified, sends requests as fast as possible.
Input prompt length.
Output sequence length.
Startup benchmark
Measures engine initialization and model loading time.Example
- Model loading time
- Engine initialization time
- Total startup time
Understanding results
Throughput metrics
- Throughput: Tokens generated per second across all requests
- Request throughput: Requests completed per second
- More GPUs (via tensor/pipeline parallelism)
- Larger batch sizes
- Shorter sequences
- More efficient quantization
Latency metrics
- TTFT (Time to First Token): Critical for interactive use
- TPOT (Time Per Output Token): Affects streaming responsiveness
- Total latency: End-to-end generation time
- Model size
- Input/output length
- Batch size (higher batch = higher latency)
- Number of GPUs
Example workflow
1. Baseline performance
2. Scale up with tensor parallelism
3. Test with different sequence lengths
4. Measure latency characteristics
5. Benchmark production server
Optimization tips
Based on benchmark results: High throughput needed:- Increase
--max-num-seqs - Enable
--enable-chunked-prefill - Use
--enable-prefix-cachingfor repeated prefixes - Increase
--gpu-memory-utilization
- Reduce batch size (lower
--max-num-seqs) - Use tensor parallelism for large models
- Consider quantization (AWQ, GPTQ)
- Disable unnecessary features
- Reduce
--gpu-memory-utilization - Enable quantization
- Reduce
--max-model-len - Use CPU offloading if needed
Related
- vllm serve - Start API server
- EngineArgs - Configuration options