Skip to main content
The vLLM CLI provides commands for serving models, running benchmarks, and managing your deployment. All commands follow the pattern vllm [subcommand] [options].

Available commands

View all available commands:
vllm --help
CommandDescription
vllm serveLaunch OpenAI-compatible API server
vllm benchRun performance benchmarks
vllm collect-envCollect environment information for debugging
vllm run-batchRun offline batch inference

vllm serve

Launch an OpenAI-compatible HTTP API server to serve LLM completions.

Basic usage

vllm serve [model_tag] [options]
Examples:
vllm serve

Model configuration

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --dtype auto \
  --max-model-len 4096 \
  --trust-remote-code

Data parallel deployment

1

Single node, multiple GPUs

Launch with data parallelism on a single node:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 4 \
  --tensor-parallel-size 2
This uses 8 GPUs total (4 DP ranks × 2 TP size).
2

Multi-node with internal load balancing

Run on multiple nodes with a single API endpoint:
# Node 0 (head node with IP 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

# Node 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --headless \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-start-rank 2 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345
3

Multi-node with external load balancing

Run each DP rank as a separate server:
# Rank 0 (IP: 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 2 \
  --data-parallel-rank 0 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

# Rank 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 2 \
  --data-parallel-rank 1 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

Performance options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.95
Key parameters:
  • --max-num-batched-tokens - Maximum tokens processed in a single batch
  • --max-num-seqs - Maximum number of sequences in a batch
  • --enable-prefix-caching - Enable KV cache reuse for repeated prompts
  • --enable-chunked-prefill - Split large prompts into chunks
  • --gpu-memory-utilization - Fraction of GPU memory to use (0.0-1.0)

Chat templates

Specify a custom chat template:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --chat-template ./templates/custom_chat.jinja
Override content format detection:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --chat-template-content-format openai

Server options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key token-abc123 \
  --enable-request-id-headers \
  --enable-offline-docs \
  --uvicorn-log-level info
Server parameters:
  • --host - Host IP address (default: None)
  • --port - Port number (default: 8000)
  • --api-key - API key for authentication
  • --enable-request-id-headers - Enable X-Request-ID header tracking
  • --enable-offline-docs - Enable offline API documentation
  • --uvicorn-log-level - Logging level: critical, error, warning, info, debug, trace

Advanced deployment

# For multi-node deployments - no API server
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --headless \
  --data-parallel-size 4

Environment variables

Common environment variables for vllm serve:
# CUDA devices
export CUDA_VISIBLE_DEVICES=0,1,2,3

# vLLM settings
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
export VLLM_MAX_AUDIO_CLIP_FILESIZE_MB=25

vllm serve meta-llama/Llama-3.2-1B-Instruct

vllm bench

Run performance benchmarks to measure throughput and latency.

Benchmark types

vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --dataset-name sonnet \
  --num-prompts 1000

Benchmark options

vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --max-num-batched-tokens 8192
Common parameters:
  • --model - Model to benchmark
  • --dataset-name - Dataset to use (sonnet, sharegpt, random)
  • --num-prompts - Number of prompts to process
  • --input-len - Input sequence length
  • --output-len - Output sequence length
  • --request-rate - Requests per second (for serving benchmarks)

vllm collect-env

Collect environment information for debugging and issue reporting:
vllm collect-env
This outputs:
  • Python version
  • PyTorch version
  • CUDA version
  • vLLM version
  • GPU information
  • System details
Example output:
vLLM Version: 0.6.0
Python Version: 3.10.12
PyTorch Version: 2.4.0+cu121
CUDA Version: 12.1
GPU: NVIDIA A100-SXM4-80GB

vllm run-batch

Run offline batch inference from command line:
vllm run-batch \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --input-file prompts.jsonl \
  --output-file results.jsonl
Input file format (JSONL):
{"prompt": "Hello, my name is", "max_tokens": 50}
{"prompt": "The capital of France is", "max_tokens": 50}

Contextual help

Get help for specific command groups:
# View model configuration options
vllm serve --help=ModelConfig

# View frontend server options
vllm serve --help=Frontend

# View all options at once
vllm serve --help=all

Version information

Check vLLM version:
vllm --version

Common workflows

1

Development server

Quick local server for testing:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --port 8000 \
  --max-model-len 2048
2

Production deployment

Production-ready configuration:
vllm serve meta-llama/Llama-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key $API_KEY \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --disable-log-stats
3

Benchmark comparison

Compare different configurations:
# Baseline
vllm bench throughput --model meta-llama/Llama-3.2-1B-Instruct

# With prefix caching
vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --enable-prefix-caching
For complete parameter documentation, visit the configuration reference.

Troubleshooting

If you encounter CUDA out-of-memory errors, try:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --gpu-memory-utilization 0.8 \
  --max-model-len 2048
Common issues:
  1. Port already in use: Change the port with --port 8080
  2. Model not found: Ensure HuggingFace credentials are set: huggingface-cli login
  3. GPU memory issues: Reduce --gpu-memory-utilization or --max-model-len
  4. Slow startup: Add --enforce-eager to skip CUDA graph compilation

Examples

Explore full CLI examples:

Build docs developers (and LLMs) love