vllm serve command launches an OpenAI-compatible API server. These arguments control server behavior, authentication, CORS, SSL, and other HTTP server settings.
Server arguments are separate from engine arguments. The server arguments configure the API server wrapper, while engine arguments configure the underlying inference engine.
Configuration methods
You can configure the server using:- Command line arguments
- YAML configuration file
YAML configuration file
You can load arguments from a YAML config file:Precedence order: command line > config file > defaultsIf an argument appears in both command line and config file, the command line value takes precedence.
Server arguments
Basic server settings
Host IP address to bind the server to.
Port number to run the server on.
Unix domain socket path. If set, host and port arguments are ignored.
Log level for uvicorn server.Options:
critical, error, warning, info, debug, traceDisable uvicorn access logging.
Comma-separated list of endpoint paths to exclude from access logs.Useful to reduce log noise from health checks:
Authentication
API keys required in the request header.Clients must include the key in requests:
CORS configuration
Allow credentials in CORS requests.
Allowed origins for CORS.
Allowed HTTP methods for CORS.
Allowed headers for CORS.
SSL/TLS configuration
Path to SSL key file for HTTPS.
Path to SSL certificate file for HTTPS.
Path to CA certificates file.
Automatically refresh SSL context when certificate files change.
Whether client certificate is required (see Python ssl module).
Chat template configuration
Path to chat template file or template string.
Trust chat templates provided in requests.
The role name to return when
add_generation_prompt=true.LoRA configuration
LoRA modules to load at startup.Old format:New JSON format:
Tool calling
Enable automatic tool choice for supported models.Requires
--tool-call-parser to be specified.Tool call parser for the model.Built-in parsers:
hermes, mistral, internlm, llama3_jsonPlugin for custom tool parser.
Logging configuration
Maximum number of prompt characters or prompt ID numbers to print in logs.
None means unlimited.Log model outputs (generations).Requires
--enable-log-requests.Log stack trace of error responses.
Advanced server settings
Run the API server in the same process as the model serving engine.
FastAPI root_path when app is behind a path-based routing proxy.
Additional ASGI middleware to apply.
Add X-Request-Id header to responses.
Disable FastAPI’s OpenAPI schema, Swagger UI, and ReDoc endpoints.
Maximum size (bytes) of incomplete HTTP event for h11 parser.Default: 4 MB. Helps mitigate header abuse.
Maximum number of HTTP headers allowed.Helps mitigate header abuse.
Data parallel settings
Run in headless mode for multi-node data parallelism.
Number of API server processes to run.Defaults to
data_parallel_size if not specified.Usage examples
Basic server
Server with authentication
HTTPS server
Server with custom chat template
Server with tool calling
Multi-API server deployment
See also
- Engine arguments - Configuration for the inference engine
- Environment variables - Runtime environment configuration
- Optimization guide - Performance tuning strategies