The vllm chat command provides an interactive chat interface that connects to a running vLLM API server.
Basic usage
Prerequisites
Before using vllm chat, you need a running vLLM server:
# In terminal 1
vllm serve meta-llama/Llama-2-7b-chat-hf
# In terminal 2
vllm chat
Examples
Basic interactive chat
Starts an interactive session:
Using model: meta-llama/Llama-2-7b-chat-hf
Please enter a message for the chat model:
> Hello! How are you?
[Assistant response]
> What is the capital of France?
[Assistant response]
Quick single message
vllm chat --quick "What is the capital of France?"
Sends a single message and exits:
Using model: meta-llama/Llama-2-7b-chat-hf
The capital of France is Paris.
Connect to custom server
vllm chat --url http://192.168.1.100:8080/v1
Specify model name
vllm chat --model-name gpt-3.5-turbo
With system prompt
vllm chat --system-prompt "You are a helpful Python programming assistant."
The system prompt is added to the beginning of the conversation:
Using model: meta-llama/Llama-2-7b-chat-hf
Please enter a message for the chat model:
> How do I read a file in Python?
[Assistant provides Python-focused response]
With API key
vllm chat --api-key your-secret-key
Options
--url
string
default:"http://localhost:8000/v1"
URL of the running OpenAI-compatible API server.
The model name to use. If not specified, uses the first available model from the server.
System prompt to add at the beginning of the conversation.
Send a single message and exit. Alias: -q.
API key for authentication. Can also use OPENAI_API_KEY environment variable.
Interactive controls
During an interactive chat session:
- Enter: Send message
- Ctrl+C or Ctrl+Z: Exit chat
- Ctrl+D (EOF): Exit chat
Advanced usage
Multi-turn conversation with context
The chat command maintains conversation history:
> What is 2+2?
The answer is 4.
> What about that number multiplied by 3?
That would be 12 (4 * 3 = 12).
Code assistant
vllm chat \
--system-prompt "You are an expert software engineer. Provide concise, working code examples." \
--model-name codellama/CodeLlama-13b-Instruct-hf
Quick lookups
# Quick fact check
vllm chat -q "What is the tallest mountain in the world?"
# Quick code snippet
vllm chat -q "Write a Python function to reverse a string"
# Quick translation
vllm chat -q "Translate 'Hello, how are you?' to Spanish"
Example workflows
Research assistant
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 4
Then in another terminal:
vllm chat --system-prompt "You are a research assistant. Provide detailed, well-researched answers with sources when possible."
Creative writing
vllm chat --system-prompt "You are a creative writing assistant. Help the user with story ideas, character development, and writing techniques."
> I want to write a sci-fi story about time travel. Give me some unique plot ideas.
[Assistant provides creative suggestions]
> I like the second idea. Help me develop the main character.
[Continues conversation]
Language learning
vllm chat --system-prompt "You are a Spanish language tutor. Respond in Spanish and English, correcting any mistakes I make."
Environment variables
API key for authentication. Used if --api-key is not provided.
Comparison with vllm complete
Use vllm chat for:
- Multi-turn conversations
- Chat models (Llama-2-chat, Mistral-Instruct, etc.)
- Interactive assistance
Use vllm complete for:
- Single-shot text completion
- Base models
- Programmatic text generation