Skip to main content
The vllm chat command provides an interactive chat interface that connects to a running vLLM API server.

Basic usage

vllm chat [OPTIONS]

Prerequisites

Before using vllm chat, you need a running vLLM server:
# In terminal 1
vllm serve meta-llama/Llama-2-7b-chat-hf

# In terminal 2
vllm chat

Examples

Basic interactive chat

vllm chat
Starts an interactive session:
Using model: meta-llama/Llama-2-7b-chat-hf
Please enter a message for the chat model:
> Hello! How are you?
[Assistant response]
> What is the capital of France?
[Assistant response]

Quick single message

vllm chat --quick "What is the capital of France?"
Sends a single message and exits:
Using model: meta-llama/Llama-2-7b-chat-hf
The capital of France is Paris.

Connect to custom server

vllm chat --url http://192.168.1.100:8080/v1

Specify model name

vllm chat --model-name gpt-3.5-turbo

With system prompt

vllm chat --system-prompt "You are a helpful Python programming assistant."
The system prompt is added to the beginning of the conversation:
Using model: meta-llama/Llama-2-7b-chat-hf
Please enter a message for the chat model:
> How do I read a file in Python?
[Assistant provides Python-focused response]

With API key

vllm chat --api-key your-secret-key

Options

--url
string
default:"http://localhost:8000/v1"
URL of the running OpenAI-compatible API server.
--model-name
string
The model name to use. If not specified, uses the first available model from the server.
--system-prompt
string
System prompt to add at the beginning of the conversation.
--quick
string
Send a single message and exit. Alias: -q.
--api-key
string
API key for authentication. Can also use OPENAI_API_KEY environment variable.

Interactive controls

During an interactive chat session:
  • Enter: Send message
  • Ctrl+C or Ctrl+Z: Exit chat
  • Ctrl+D (EOF): Exit chat

Advanced usage

Multi-turn conversation with context

The chat command maintains conversation history:
> What is 2+2?
The answer is 4.
> What about that number multiplied by 3?
That would be 12 (4 * 3 = 12).

Code assistant

vllm chat \
  --system-prompt "You are an expert software engineer. Provide concise, working code examples." \
  --model-name codellama/CodeLlama-13b-Instruct-hf

Quick lookups

# Quick fact check
vllm chat -q "What is the tallest mountain in the world?"

# Quick code snippet
vllm chat -q "Write a Python function to reverse a string"

# Quick translation
vllm chat -q "Translate 'Hello, how are you?' to Spanish"

Example workflows

Research assistant

vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 4
Then in another terminal:
vllm chat --system-prompt "You are a research assistant. Provide detailed, well-researched answers with sources when possible."

Creative writing

vllm chat --system-prompt "You are a creative writing assistant. Help the user with story ideas, character development, and writing techniques."
> I want to write a sci-fi story about time travel. Give me some unique plot ideas.
[Assistant provides creative suggestions]
> I like the second idea. Help me develop the main character.
[Continues conversation]

Language learning

vllm chat --system-prompt "You are a Spanish language tutor. Respond in Spanish and English, correcting any mistakes I make."

Environment variables

OPENAI_API_KEY
string
API key for authentication. Used if --api-key is not provided.

Comparison with vllm complete

Use vllm chat for:
  • Multi-turn conversations
  • Chat models (Llama-2-chat, Mistral-Instruct, etc.)
  • Interactive assistance
Use vllm complete for:
  • Single-shot text completion
  • Base models
  • Programmatic text generation

Build docs developers (and LLMs) love