Skip to main content
The vllm serve command launches a local OpenAI-compatible API server for serving LLM completions via HTTP.

Basic usage

vllm serve MODEL_NAME [OPTIONS]

Examples

Serve a model with defaults

vllm serve facebook/opt-125m
This starts a server on http://localhost:8000 with default settings.

Specify tensor parallelism

vllm serve meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 4
Distribute the model across 4 GPUs using tensor parallelism.

Configure memory and context length

vllm serve mistralai/Mistral-7B-v0.1 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 8192

Enable prefix caching

vllm serve meta-llama/Llama-2-7b-hf \
  --enable-prefix-caching

Serve with quantization

vllm serve TheBloke/Llama-2-13B-AWQ \
  --quantization awq

Custom host and port

vllm serve facebook/opt-125m \
  --host 0.0.0.0 \
  --port 8080

Common options

Model configuration

model
string
required
The model name or path. First positional argument.
--tokenizer
string
Tokenizer name or path. Defaults to model path.
--revision
string
Model revision (branch, tag, or commit).
--dtype
string
default:"auto"
Data type: auto, float16, bfloat16, or float32.
--quantization
string
Quantization method: awq, gptq, or fp8.
--max-model-len
integer
Maximum context length.

Server configuration

--host
string
default:"0.0.0.0"
Host IP address to bind the server.
--port
integer
default:"8000"
Port number for the server.
--ssl-keyfile
string
Path to SSL key file for HTTPS.
--ssl-certfile
string
Path to SSL certificate file for HTTPS.
--api-key
string
API key for authentication. Can also use VLLM_API_KEY environment variable.

Parallelism

--tensor-parallel-size
integer
default:"1"
Number of GPUs for tensor parallelism.
--pipeline-parallel-size
integer
default:"1"
Number of pipeline stages.
--data-parallel-size
integer
default:"1"
Number of data parallel replicas.

Memory management

--gpu-memory-utilization
float
default:"0.9"
Fraction of GPU memory to use (0.0-1.0).
--swap-space
float
default:"4"
CPU swap space in GiB.
--enable-prefix-caching
boolean
Enable prefix caching for common prompts.
--block-size
integer
default:"16"
Token block size for paged attention.

Performance

--enforce-eager
boolean
Disable CUDA graphs (use eager mode).
--max-num-seqs
integer
Maximum sequences to process in a batch.
--max-num-batched-tokens
integer
Maximum tokens to batch together.
--enable-chunked-prefill
boolean
Process long prompts in chunks.

Multi-modal

--limit-mm-per-prompt
object
Maximum multi-modal inputs per prompt.
--mm-processor-kwargs
object
Additional kwargs for multi-modal processor.

Advanced

--trust-remote-code
boolean
Trust remote code from HuggingFace.
--enable-lora
boolean
Enable LoRA adapter support.
--max-loras
integer
default:"1"
Maximum number of LoRA adapters.
--disable-log-stats
boolean
Disable logging statistics.

Advanced examples

Multi-GPU serving with LoRA

vllm serve meta-llama/Llama-2-13b-hf \
  --tensor-parallel-size 2 \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64

High-throughput configuration

vllm serve meta-llama/Llama-2-7b-hf \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill

Vision model serving

vllm serve llava-hf/llava-1.5-7b-hf \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image": 5}'

Secure server with authentication

vllm serve facebook/opt-125m \
  --api-key your-secret-key \
  --ssl-keyfile /path/to/key.pem \
  --ssl-certfile /path/to/cert.pem

Environment variables

VLLM_API_KEY
string
API key for authentication (alternative to --api-key).
VLLM_HOST
string
Server host (alternative to --host).
VLLM_PORT
integer
Server port (alternative to --port).

Accessing the server

Once running, the server provides OpenAI-compatible endpoints:
  • GET /v1/models - List models
  • POST /v1/completions - Text completions
  • POST /v1/chat/completions - Chat completions
  • POST /v1/embeddings - Generate embeddings

Test with curl

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 50
  }'

Use with OpenAI Python client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="facebook/opt-125m",
    messages=[{"role": "user", "content": "Hello!"}]
)

Build docs developers (and LLMs) love