vllm [subcommand] [options].
Available commands
View all available commands:| Command | Description |
|---|---|
vllm serve | Launch OpenAI-compatible API server |
vllm bench | Run performance benchmarks |
vllm collect-env | Collect environment information for debugging |
vllm run-batch | Run offline batch inference |
vllm serve
Launch an OpenAI-compatible HTTP API server to serve LLM completions.Basic usage
Model configuration
- Model loading
- Quantization
- Tensor parallelism
- Pipeline parallelism
Data parallel deployment
Single node, multiple GPUs
Launch with data parallelism on a single node:This uses 8 GPUs total (4 DP ranks × 2 TP size).
Performance options
--max-num-batched-tokens- Maximum tokens processed in a single batch--max-num-seqs- Maximum number of sequences in a batch--enable-prefix-caching- Enable KV cache reuse for repeated prompts--enable-chunked-prefill- Split large prompts into chunks--gpu-memory-utilization- Fraction of GPU memory to use (0.0-1.0)
Chat templates
Specify a custom chat template:Server options
--host- Host IP address (default:None)--port- Port number (default:8000)--api-key- API key for authentication--enable-request-id-headers- Enable X-Request-ID header tracking--enable-offline-docs- Enable offline API documentation--uvicorn-log-level- Logging level: critical, error, warning, info, debug, trace
Advanced deployment
Environment variables
Common environment variables forvllm serve:
vllm bench
Run performance benchmarks to measure throughput and latency.Benchmark types
- Throughput
- Latency
- Serving
Benchmark options
--model- Model to benchmark--dataset-name- Dataset to use (sonnet, sharegpt, random)--num-prompts- Number of prompts to process--input-len- Input sequence length--output-len- Output sequence length--request-rate- Requests per second (for serving benchmarks)
vllm collect-env
Collect environment information for debugging and issue reporting:- Python version
- PyTorch version
- CUDA version
- vLLM version
- GPU information
- System details
vllm run-batch
Run offline batch inference from command line:Contextual help
Get help for specific command groups:Version information
Check vLLM version:Common workflows
For complete parameter documentation, visit the configuration reference.
Troubleshooting
Common issues:- Port already in use: Change the port with
--port 8080 - Model not found: Ensure HuggingFace credentials are set:
huggingface-cli login - GPU memory issues: Reduce
--gpu-memory-utilizationor--max-model-len - Slow startup: Add
--enforce-eagerto skip CUDA graph compilation
Examples
Explore full CLI examples:- Server configuration - vllm/entrypoints/cli/serve.py:33
- Multi-node setup script
- Data parallel deployment