Skip to main content
The vllm serve command launches an OpenAI-compatible API server. These arguments control server behavior, authentication, CORS, SSL, and other HTTP server settings.
Server arguments are separate from engine arguments. The server arguments configure the API server wrapper, while engine arguments configure the underlying inference engine.

Configuration methods

You can configure the server using:
  1. Command line arguments
  2. YAML configuration file

YAML configuration file

You can load arguments from a YAML config file:
# config.yaml
model: meta-llama/Llama-3.1-8B-Instruct
host: "127.0.0.1"
port: 8000
tensor-parallel-size: 2
uvicorn-log-level: "info"
enable-log-requests: true
Use the config file:
vllm serve --config config.yaml
Precedence order: command line > config file > defaultsIf an argument appears in both command line and config file, the command line value takes precedence.

Server arguments

Basic server settings

host
str
default:"None"
Host IP address to bind the server to.
vllm serve MODEL --host 0.0.0.0
port
int
default:"8000"
Port number to run the server on.
vllm serve MODEL --port 8080
uds
str
default:"None"
Unix domain socket path. If set, host and port arguments are ignored.
vllm serve MODEL --uds /tmp/vllm.sock
uvicorn_log_level
str
default:"info"
Log level for uvicorn server.Options: critical, error, warning, info, debug, trace
disable_uvicorn_access_log
bool
default:"false"
Disable uvicorn access logging.
disable_access_log_for_endpoints
str
default:"None"
Comma-separated list of endpoint paths to exclude from access logs.Useful to reduce log noise from health checks:
--disable-access-log-for-endpoints "/health,/metrics,/ping"

Authentication

api_key
list[str]
default:"None"
API keys required in the request header.
vllm serve MODEL --api-key sk-key1 --api-key sk-key2
Clients must include the key in requests:
curl -H "Authorization: Bearer sk-key1" http://localhost:8000/v1/completions

CORS configuration

allow_credentials
bool
default:"false"
Allow credentials in CORS requests.
allowed_origins
list[str]
default:"['*']"
Allowed origins for CORS.
--allowed-origins '["https://example.com", "https://app.example.com"]'
allowed_methods
list[str]
default:"['*']"
Allowed HTTP methods for CORS.
allowed_headers
list[str]
default:"['*']"
Allowed headers for CORS.

SSL/TLS configuration

ssl_keyfile
str
default:"None"
Path to SSL key file for HTTPS.
ssl_certfile
str
default:"None"
Path to SSL certificate file for HTTPS.
ssl_ca_certs
str
default:"None"
Path to CA certificates file.
enable_ssl_refresh
bool
default:"false"
Automatically refresh SSL context when certificate files change.
ssl_cert_reqs
int
default:"0"
Whether client certificate is required (see Python ssl module).

Chat template configuration

chat_template
str
default:"None"
Path to chat template file or template string.
vllm serve MODEL --chat-template /path/to/template.jinja
trust_request_chat_template
bool
default:"false"
Trust chat templates provided in requests.
Only enable if you trust all API clients, as templates can execute arbitrary code.
response_role
str
default:"assistant"
The role name to return when add_generation_prompt=true.

LoRA configuration

lora_modules
list[LoRAModulePath]
default:"None"
LoRA modules to load at startup.Old format:
--lora-modules name1=path1 name2=path2
New JSON format:
--lora-modules '{"name": "adapter1", "path": "/path/to/lora"}'

Tool calling

enable_auto_tool_choice
bool
default:"false"
Enable automatic tool choice for supported models.Requires --tool-call-parser to be specified.
tool_call_parser
str
default:"None"
Tool call parser for the model.Built-in parsers: hermes, mistral, internlm, llama3_json
vllm serve MODEL --enable-auto-tool-choice --tool-call-parser hermes
tool_parser_plugin
str
default:""
Plugin for custom tool parser.

Logging configuration

max_log_len
int
default:"None"
Maximum number of prompt characters or prompt ID numbers to print in logs.None means unlimited.
enable_log_outputs
bool
default:"false"
Log model outputs (generations).Requires --enable-log-requests.
log_error_stack
bool
default:"false"
Log stack trace of error responses.

Advanced server settings

disable_frontend_multiprocessing
bool
default:"false"
Run the API server in the same process as the model serving engine.
root_path
str
default:"None"
FastAPI root_path when app is behind a path-based routing proxy.
middleware
list[str]
default:"[]"
Additional ASGI middleware to apply.
--middleware my_package.middleware.MyMiddleware
enable_request_id_headers
bool
default:"false"
Add X-Request-Id header to responses.
disable_fastapi_docs
bool
default:"false"
Disable FastAPI’s OpenAPI schema, Swagger UI, and ReDoc endpoints.
h11_max_incomplete_event_size
int
default:"4194304"
Maximum size (bytes) of incomplete HTTP event for h11 parser.Default: 4 MB. Helps mitigate header abuse.
h11_max_header_count
int
default:"256"
Maximum number of HTTP headers allowed.Helps mitigate header abuse.

Data parallel settings

headless
bool
default:"false"
Run in headless mode for multi-node data parallelism.
api_server_count
int
default:"None"
Number of API server processes to run.Defaults to data_parallel_size if not specified.
vllm serve MODEL --api-server-count 4

Usage examples

Basic server

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Server with authentication

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --api-key sk-secret-key-123

HTTPS server

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --ssl-keyfile /path/to/key.pem \
  --ssl-certfile /path/to/cert.pem

Server with custom chat template

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --chat-template /path/to/template.jinja

Server with tool calling

vllm serve NousResearch/Hermes-3-Llama-3.1-8B \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Multi-API server deployment

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --api-server-count 4 \
  --data-parallel-size 2

See also

Build docs developers (and LLMs) love