vllm serve command launches a local OpenAI-compatible API server for serving LLM completions via HTTP.
Basic usage
Examples
Serve a model with defaults
http://localhost:8000 with default settings.
Specify tensor parallelism
Configure memory and context length
Enable prefix caching
Serve with quantization
Custom host and port
Common options
Model configuration
The model name or path. First positional argument.
Tokenizer name or path. Defaults to model path.
Model revision (branch, tag, or commit).
Data type:
auto, float16, bfloat16, or float32.Quantization method:
awq, gptq, or fp8.Maximum context length.
Server configuration
Host IP address to bind the server.
Port number for the server.
Path to SSL key file for HTTPS.
Path to SSL certificate file for HTTPS.
API key for authentication. Can also use
VLLM_API_KEY environment variable.Parallelism
Number of GPUs for tensor parallelism.
Number of pipeline stages.
Number of data parallel replicas.
Memory management
Fraction of GPU memory to use (0.0-1.0).
CPU swap space in GiB.
Enable prefix caching for common prompts.
Token block size for paged attention.
Performance
Disable CUDA graphs (use eager mode).
Maximum sequences to process in a batch.
Maximum tokens to batch together.
Process long prompts in chunks.
Multi-modal
Maximum multi-modal inputs per prompt.
Additional kwargs for multi-modal processor.
Advanced
Trust remote code from HuggingFace.
Enable LoRA adapter support.
Maximum number of LoRA adapters.
Disable logging statistics.
Advanced examples
Multi-GPU serving with LoRA
High-throughput configuration
Vision model serving
Secure server with authentication
Environment variables
API key for authentication (alternative to
--api-key).Server host (alternative to
--host).Server port (alternative to
--port).Accessing the server
Once running, the server provides OpenAI-compatible endpoints:GET /v1/models- List modelsPOST /v1/completions- Text completionsPOST /v1/chat/completions- Chat completionsPOST /v1/embeddings- Generate embeddings
Test with curl
Use with OpenAI Python client
Related
- Completions API - Text completion endpoint
- Chat API - Chat completion endpoint
- EngineArgs - All configuration options