Serve LLMs with vLLM’s OpenAI-compatible HTTP API for chat, completions, embeddings, and more.
vLLM provides an HTTP server that implements OpenAI’s API specifications, allowing you to serve models and interact with them using standard OpenAI clients and tools.
By default, the server applies generation_config.json from the HuggingFace model repository if it exists. To disable this behavior, pass --generation-config vllm when launching the server.
The Completions API generates text based on a prompt:
from openai import OpenAIclient = OpenAI( base_url="http://localhost:8000/v1", api_key="token-abc123",)completion = client.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", prompt="A robot may not injure a human being", max_tokens=50, temperature=0.7,)print(completion.choices[0].text)
completion = client.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", prompt="A robot may not injure a human being", stream=True,)for chunk in completion: print(chunk.choices[0].text, end="", flush=True)
The suffix parameter is not supported in vLLM’s Completions API.
The Chat API supports conversational interactions with chat-tuned models:
from openai import OpenAIclient = OpenAI( base_url="http://localhost:8000/v1", api_key="token-abc123",)messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"}, {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, {"role": "user", "content": "Where was it played?"},]completion = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=messages,)print(completion.choices[0].message.content)
tools = [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location"], }, }, }]completion = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "What's the weather in Boston?"}], tools=tools,)print(completion.choices[0].message.tool_calls)
Set parallel_tool_calls=false to ensure vLLM returns at most one tool call per request. The default (true) allows multiple tool calls but doesn’t guarantee them.
Models require a chat template to format messages properly. Most models include this in their tokenizer config. For models without one, specify a custom template:
from openai import OpenAIclient = OpenAI( base_url="http://localhost:8000/v1", api_key="token-abc123",)response = client.embeddings.create( model="BAAI/bge-base-en-v1.5", input="The food was delicious and the waiter was very friendly.",)print(response.data[0].embedding)
response = client.embeddings.create( model="BAAI/bge-base-en-v1.5", input=[ "The food was delicious.", "The service was excellent.", "Great atmosphere!", ],)for item in response.data: print(f"Embedding {item.index}: {item.embedding[:5]}...") # First 5 dimensions