vllm serve

The vllm serve command launches a local OpenAI-compatible API server for serving LLM completions via HTTP.

Basic usage

vllm serve MODEL_NAME [OPTIONS]

Examples

Serve a model with defaults

vllm serve facebook/opt-125m

This starts a server on http://localhost:8000 with default settings.

Specify tensor parallelism

vllm serve meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 4

Distribute the model across 4 GPUs using tensor parallelism.

Configure memory and context length

vllm serve mistralai/Mistral-7B-v0.1 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 8192

Enable prefix caching

vllm serve meta-llama/Llama-2-7b-hf \
  --enable-prefix-caching

Serve with quantization

vllm serve TheBloke/Llama-2-13B-AWQ \
  --quantization awq

Custom host and port

vllm serve facebook/opt-125m \
  --host 0.0.0.0 \
  --port 8080

Common options

Model configuration

model

string

required

The model name or path. First positional argument.

--tokenizer

string

Tokenizer name or path. Defaults to model path.

--revision

string

Model revision (branch, tag, or commit).

--dtype

string

default:"auto"

Data type: auto, float16, bfloat16, or float32.

--quantization

string

Quantization method: awq, gptq, or fp8.

--max-model-len

integer

Maximum context length.

Server configuration

--host

string

default:"0.0.0.0"

Host IP address to bind the server.

--port

integer

default:"8000"

Port number for the server.

--ssl-keyfile

string

Path to SSL key file for HTTPS.

--ssl-certfile

string

Path to SSL certificate file for HTTPS.

--api-key

string

API key for authentication. Can also use VLLM_API_KEY environment variable.

Parallelism

--tensor-parallel-size

integer

default:"1"

Number of GPUs for tensor parallelism.

--pipeline-parallel-size

integer

default:"1"

Number of pipeline stages.

--data-parallel-size

integer

default:"1"

Number of data parallel replicas.

Memory management

--gpu-memory-utilization

float

default:"0.9"

Fraction of GPU memory to use (0.0-1.0).

--swap-space

float

default:"4"

CPU swap space in GiB.

--enable-prefix-caching

boolean

Enable prefix caching for common prompts.

--block-size

integer

default:"16"

Token block size for paged attention.

Performance

--enforce-eager

boolean

Disable CUDA graphs (use eager mode).

--max-num-seqs

integer

Maximum sequences to process in a batch.

--max-num-batched-tokens

integer

Maximum tokens to batch together.

--enable-chunked-prefill

boolean

Process long prompts in chunks.

--limit-mm-per-prompt

object

Maximum multi-modal inputs per prompt.

--mm-processor-kwargs

object

Additional kwargs for multi-modal processor.

Advanced

--trust-remote-code

boolean

Trust remote code from HuggingFace.

--enable-lora

boolean

Enable LoRA adapter support.

--max-loras

integer

default:"1"

Maximum number of LoRA adapters.

--disable-log-stats

boolean

Disable logging statistics.

Advanced examples

Multi-GPU serving with LoRA

vllm serve meta-llama/Llama-2-13b-hf \
  --tensor-parallel-size 2 \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64

High-throughput configuration

vllm serve meta-llama/Llama-2-7b-hf \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill

Vision model serving

vllm serve llava-hf/llava-1.5-7b-hf \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image": 5}'

Secure server with authentication

vllm serve facebook/opt-125m \
  --api-key your-secret-key \
  --ssl-keyfile /path/to/key.pem \
  --ssl-certfile /path/to/cert.pem

Environment variables

VLLM_API_KEY

string

API key for authentication (alternative to --api-key).

VLLM_HOST

string

Server host (alternative to --host).

VLLM_PORT

integer

Server port (alternative to --port).

Accessing the server

Once running, the server provides OpenAI-compatible endpoints:

GET /v1/models - List models
POST /v1/completions - Text completions
POST /v1/chat/completions - Chat completions
POST /v1/embeddings - Generate embeddings

Test with curl

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 50
  }'

Use with OpenAI Python client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="facebook/opt-125m",
    messages=[{"role": "user", "content": "Hello!"}]
)

Completions API - Text completion endpoint
Chat API - Chat completion endpoint
EngineArgs - All configuration options

Python API

REST API

CLI Reference

Basic usage

Examples

Serve a model with defaults

Specify tensor parallelism

Configure memory and context length

Enable prefix caching

Serve with quantization

Custom host and port

Common options

Model configuration

Server configuration

Parallelism

Memory management

Performance

Advanced

Advanced examples

Multi-GPU serving with LoRA

High-throughput configuration

Vision model serving

Secure server with authentication

Environment variables

Accessing the server

Test with curl

Use with OpenAI Python client

Build docs developers (and LLMs) love

Python API

REST API

CLI Reference

​Basic usage

​Examples

​Serve a model with defaults

​Specify tensor parallelism

​Configure memory and context length

​Enable prefix caching

​Serve with quantization

​Custom host and port

​Common options

​Model configuration

​Server configuration

​Parallelism

​Memory management

​Performance

​Multi-modal

​Advanced

​Advanced examples

​Multi-GPU serving with LoRA

​High-throughput configuration

​Vision model serving

​Secure server with authentication

​Environment variables

​Accessing the server

​Test with curl

​Use with OpenAI Python client

​Related

Build docs developers (and LLMs) love

Basic usage

Examples

Serve a model with defaults

Specify tensor parallelism

Configure memory and context length

Enable prefix caching

Serve with quantization

Custom host and port

Common options

Model configuration

Server configuration

Parallelism

Memory management

Performance

Multi-modal

Advanced

Advanced examples

Multi-GPU serving with LoRA

High-throughput configuration

Vision model serving

Secure server with authentication

Environment variables

Accessing the server

Test with curl

Use with OpenAI Python client

Related