The vllm complete command provides an interactive text completion interface that connects to a running vLLM API server.
Basic usage
Prerequisites
You need a running vLLM server:
# In terminal 1
vllm serve facebook/opt-125m
# In terminal 2
vllm complete
Examples
Basic interactive completion
Starts an interactive session:
Using model: facebook/opt-125m
Please enter prompt to complete:
> Once upon a time
there was a young girl who lived in a small village...
> The weather today is
sunny and warm with a light breeze...
Quick single completion
vllm complete --quick "The capital of France is"
Generates a single completion and exits:
Using model: facebook/opt-125m
Paris, which is located in the northern part of the country.
Connect to custom server
vllm complete --url http://192.168.1.100:8080/v1
Specify model name
vllm complete --model-name gpt-3.5-turbo
Control output length
vllm complete --max-tokens 200
With API key
vllm complete --api-key your-secret-key
Options
--url
string
default:"http://localhost:8000/v1"
URL of the running OpenAI-compatible API server.
The model name to use. If not specified, uses the first available model from the server.
Maximum number of tokens to generate per completion.
Send a single prompt and exit. Alias: -q.
API key for authentication. Can also use OPENAI_API_KEY environment variable.
Interactive controls
During an interactive session:
- Enter: Submit prompt for completion
- Ctrl+C or Ctrl+Z: Exit
- Ctrl+D (EOF): Exit
Use cases
Code completion
vllm serve codellama/CodeLlama-7b-hf
Then:
> def fibonacci(n):
"""Calculate the nth Fibonacci number."""
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
Story generation
vllm complete --max-tokens 500
> In a world where magic was real,
the young wizard apprentice discovered an ancient spellbook hidden in the
library's forbidden section. As he opened the dusty tome, glowing runes
appeared on the pages, revealing secrets that had been lost for centuries...
Text continuation
vllm complete -q "The three laws of robotics are:" --max-tokens 150
Creative writing prompts
> Write a haiku about programming:
Code flows like water,
Bugs emerge from the shadows,
Debugger saves all.
Advanced usage
Batch completions
Use a script to process multiple prompts:
#!/bin/bash
while IFS= read -r prompt; do
vllm complete -q "$prompt" --max-tokens 100
done < prompts.txt
With custom parameters via API
For more control, use the REST API directly:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.8,
"top_p": 0.95
}'
Comparison with vllm chat
Use vllm complete for:
- Raw text completion
- Base (non-chat) models
- Single-turn generation
- Code completion
- Creative writing
Use vllm chat for:
- Conversational interactions
- Chat-tuned models
- Multi-turn dialogues
- Question answering
Example: Documentation generation
vllm serve codellama/CodeLlama-13b-hf
Then:
> def process_data(df, columns):
"""Process DataFrame columns.
Args:
df: Input DataFrame
columns: List of column names to process
Returns:
Processed DataFrame with transformed columns
"""
Environment variables
API key for authentication. Used if --api-key is not provided.