Skip to main content
vLLM provides an HTTP server that implements OpenAI’s API specifications, allowing you to serve models and interact with them using standard OpenAI clients and tools.

Quick start

Start the server with the vllm serve command:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123
The server starts on http://localhost:8000 by default. You can now send requests using the OpenAI Python client:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
)

print(completion.choices[0].message)
By default, the server applies generation_config.json from the HuggingFace model repository if it exists. To disable this behavior, pass --generation-config vllm when launching the server.

Supported APIs

vLLM implements the following OpenAI-compatible endpoints:

Standard OpenAI APIs

APIEndpointDescription
Completions/v1/completionsText completion for generative models
Chat Completions/v1/chat/completionsChat-based completions with conversation history
Responses/v1/responsesOpenAI Responses API
Embeddings/v1/embeddingsGenerate embeddings for text
Transcriptions/v1/audio/transcriptionsAudio-to-text transcription (ASR)
Translations/v1/audio/translationsAudio translation
Realtime/v1/realtimeWebSocket-based streaming transcription

vLLM Custom APIs

APIEndpointDescription
Tokenizer/tokenize, /detokenizeEncode and decode tokens
Pooling/poolingExtract hidden states from pooling models
Classification/classifyText classification
Score/scoreSentence similarity scoring
Re-rank/rerank, /v1/rerank, /v2/rerankDocument re-ranking

Completions API

The Completions API generates text based on a prompt:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    max_tokens=50,
    temperature=0.7,
)

print(completion.choices[0].text)

Streaming completions

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    stream=True,
)

for chunk in completion:
    print(chunk.choices[0].text, end="", flush=True)
The suffix parameter is not supported in vLLM’s Completions API.

Chat Completions API

The Chat API supports conversational interactions with chat-tuned models:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"},
]

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=messages,
)

print(completion.choices[0].message.content)

Streaming chat completions

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=messages,
    stream=True,
)

for chunk in completion:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Tool calling

vLLM supports function calling and tool usage:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Boston?"}],
    tools=tools,
)

print(completion.choices[0].message.tool_calls)
Set parallel_tool_calls=false to ensure vLLM returns at most one tool call per request. The default (true) allows multiple tool calls but doesn’t guarantee them.

Chat templates

Models require a chat template to format messages properly. Most models include this in their tokenizer config. For models without one, specify a custom template:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --chat-template ./path-to-chat-template.jinja
You can also override the content format detection:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --chat-template-content-format openai
Supported formats:
  • string - Simple string content: "Hello world"
  • openai - List of dictionaries: [{"type": "text", "text": "Hello world"}]

Embeddings API

Generate vector embeddings for text:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

response = client.embeddings.create(
    model="BAAI/bge-base-en-v1.5",
    input="The food was delicious and the waiter was very friendly.",
)

print(response.data[0].embedding)

Batch embeddings

response = client.embeddings.create(
    model="BAAI/bge-base-en-v1.5",
    input=[
        "The food was delicious.",
        "The service was excellent.",
        "Great atmosphere!",
    ],
)

for item in response.data:
    print(f"Embedding {item.index}: {item.embedding[:5]}...")  # First 5 dimensions

Extra parameters

vLLM supports parameters beyond the OpenAI API specification. Pass them via extra_body in the Python client:
completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={
        "top_k": 50,
        "repetition_penalty": 1.1,
        "structured_outputs": {"choice": ["positive", "negative"]},
    },
)
For direct HTTP requests, merge them into the JSON payload:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer token-abc123" \
  -d '{
    "model": "NousResearch/Meta-Llama-3-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "top_k": 50,
    "repetition_penalty": 1.1
  }'

Server configuration

Common options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 8080

API security

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --api-key secret-token \
  --enable-request-id-headers
Use the API key in requests:
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="secret-token",
)

Request ID tracking

Enable request ID headers for tracing:
completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={"x-request-id": "my-request-123"},
)

print(completion._request_id)  # "my-request-123"

Offline documentation

For air-gapped environments, enable offline API docs:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --enable-offline-docs
Access the docs at http://localhost:8000/docs.

Multi-modal inputs

vLLM supports vision and audio inputs for compatible models:
completion = client.chat.completions.create(
    model="llava-hf/llava-1.5-7b-hf",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image.jpg"},
                },
            ],
        }
    ],
)
The image_url.detail parameter is not currently supported.

Transcriptions API

Transcribe audio files to text using ASR models:
Requires installing audio dependencies: pip install vllm[audio]
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3-turbo",
        file=audio_file,
        language="en",
        response_format="verbose_json",
    )

print(transcription.text)
Supported audio formats: FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, WEBM

With curl

curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
  -H "Authorization: Bearer token-abc123" \
  -F "[email protected]" \
  -F "model=openai/whisper-large-v3-turbo" \
  -F "language=en" \
  -F "response_format=verbose_json"

Performance tuning

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256

Docker deployment

Run vLLM server in a container:
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto
For more deployment options, see Docker deployment.

Ray Serve integration

For production-scale deployments with autoscaling and load balancing, use Ray Serve:
from ray import serve
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.entrypoints.openai.api_server import build_app

@serve.deployment
class VLLMDeployment:
    def __init__(self, model: str):
        engine_args = AsyncEngineArgs(model=model)
        self.app = build_app(engine_args)
    
    async def __call__(self, request):
        return await self.app(request)

deployment = VLLMDeployment.bind(model="meta-llama/Llama-3.2-1B-Instruct")
serve.run(deployment)
Learn more in the Ray Serve LLM documentation.

API reference

For complete API specifications and all supported parameters, see:

Build docs developers (and LLMs) love