vllm bench

The vllm bench command provides benchmarking utilities to measure vLLM performance across different scenarios.

Available benchmarks

vllm bench [SUBCOMMAND]

Subcommands

throughput - Measure throughput (tokens/second)
latency - Measure end-to-end latency
serve - Benchmark a running server
startup - Measure engine startup time

Throughput benchmark

Measures the maximum throughput (tokens per second) that vLLM can achieve.

vllm bench throughput [OPTIONS]

Examples

# Basic throughput test
vllm bench throughput --model facebook/opt-125m

# With specific parameters
vllm bench throughput \
  --model meta-llama/Llama-2-7b-hf \
  --input-len 128 \
  --output-len 128 \
  --num-prompts 1000 \
  --tensor-parallel-size 2

Common options

--model

string

required

Model name or path to benchmark.

--input-len

integer

default:"128"

Input prompt length in tokens.

--output-len

integer

default:"128"

Output sequence length in tokens.

--num-prompts

integer

default:"1000"

Number of prompts to benchmark.

--tensor-parallel-size

integer

default:"1"

Number of GPUs for tensor parallelism.

--dtype

string

default:"auto"

Data type: auto, float16, bfloat16, or float32.

Latency benchmark

Measures end-to-end latency for text generation.

vllm bench latency [OPTIONS]

Examples

# Basic latency test
vllm bench latency --model facebook/opt-125m

# Measure latency with different configurations
vllm bench latency \
  --model meta-llama/Llama-2-7b-hf \
  --input-len 256 \
  --output-len 256 \
  --num-prompts 100

Key metrics

The latency benchmark reports:

First token latency: Time to generate the first token (TTFT)
Inter-token latency: Time between subsequent tokens
End-to-end latency: Total generation time
P50, P90, P95, P99: Latency percentiles

Server benchmark

Benchmarks a running vLLM server via HTTP requests.

vllm bench serve [OPTIONS]

Examples

# Benchmark local server
vllm bench serve \
  --base-url http://localhost:8000 \
  --num-prompts 100

# Benchmark with specific request rate
vllm bench serve \
  --base-url http://localhost:8000 \
  --request-rate 10 \
  --num-prompts 500

Options

--base-url

string

default:"http://localhost:8000"

Base URL of the vLLM server.

--num-prompts

integer

default:"100"

Number of requests to send.

--request-rate

float

Request rate (requests per second). If not specified, sends requests as fast as possible.

--input-len

integer

default:"128"

Input prompt length.

--output-len

integer

default:"128"

Output sequence length.

Startup benchmark

Measures engine initialization and model loading time.

vllm bench startup [OPTIONS]

Example

vllm bench startup \
  --model meta-llama/Llama-2-7b-hf \
  --tensor-parallel-size 2

This measures:

Model loading time
Engine initialization time
Total startup time

Understanding results

Throughput metrics

Throughput: Tokens generated per second across all requests
Request throughput: Requests completed per second

Higher is better. Throughput scales with:

More GPUs (via tensor/pipeline parallelism)
Larger batch sizes
Shorter sequences
More efficient quantization

Latency metrics

TTFT (Time to First Token): Critical for interactive use
TPOT (Time Per Output Token): Affects streaming responsiveness
Total latency: End-to-end generation time

Lower is better. Latency affected by:

Model size
Input/output length
Batch size (higher batch = higher latency)
Number of GPUs

Example workflow

1. Baseline performance

vllm bench throughput \
  --model meta-llama/Llama-2-7b-hf \
  --num-prompts 1000

2. Scale up with tensor parallelism

vllm bench throughput \
  --model meta-llama/Llama-2-7b-hf \
  --num-prompts 1000 \
  --tensor-parallel-size 4

3. Test with different sequence lengths

vllm bench throughput \
  --model meta-llama/Llama-2-7b-hf \
  --input-len 512 \
  --output-len 512 \
  --num-prompts 500

4. Measure latency characteristics

vllm bench latency \
  --model meta-llama/Llama-2-7b-hf \
  --num-prompts 100

5. Benchmark production server

# Start server
vllm serve meta-llama/Llama-2-7b-hf --tensor-parallel-size 4

# In another terminal, benchmark it
vllm bench serve \
  --base-url http://localhost:8000 \
  --num-prompts 1000 \
  --request-rate 20

Optimization tips

Based on benchmark results: High throughput needed:

Increase --max-num-seqs
Enable --enable-chunked-prefill
Use --enable-prefix-caching for repeated prefixes
Increase --gpu-memory-utilization

Low latency needed:

Reduce batch size (lower --max-num-seqs)
Use tensor parallelism for large models
Consider quantization (AWQ, GPTQ)
Disable unnecessary features

Memory constrained:

Reduce --gpu-memory-utilization
Enable quantization
Reduce --max-model-len
Use CPU offloading if needed

vllm serve - Start API server
EngineArgs - Configuration options

Python API

REST API

CLI Reference

Available benchmarks

Subcommands

Throughput benchmark

Examples

Common options

Latency benchmark

Examples

Key metrics

Server benchmark

Examples

Options

Startup benchmark

Example

Understanding results

Throughput metrics

Latency metrics

Example workflow

1. Baseline performance

2. Scale up with tensor parallelism

3. Test with different sequence lengths

4. Measure latency characteristics

5. Benchmark production server

Optimization tips

Build docs developers (and LLMs) love

Python API

REST API

CLI Reference

​Available benchmarks

​Subcommands

​Throughput benchmark

​Examples

​Common options

​Latency benchmark

​Examples

​Key metrics

​Server benchmark

​Examples

​Options

​Startup benchmark

​Example

​Understanding results

​Throughput metrics

​Latency metrics

​Example workflow

​1. Baseline performance

​2. Scale up with tensor parallelism

​3. Test with different sequence lengths

​4. Measure latency characteristics

​5. Benchmark production server

​Optimization tips

​Related

Build docs developers (and LLMs) love

Available benchmarks

Subcommands

Throughput benchmark

Examples

Common options

Latency benchmark

Examples

Key metrics

Server benchmark

Examples

Options

Startup benchmark

Example

Understanding results

Throughput metrics

Latency metrics

Example workflow

1. Baseline performance

2. Scale up with tensor parallelism

3. Test with different sequence lengths

4. Measure latency characteristics

5. Benchmark production server

Optimization tips

Related