vLLM is a high-throughput and memory-efficient inference engine for large language models. It provides significant performance improvements over standard PyTorch inference through continuous batching, PagedAttention, and optimized CUDA kernels.
Why vLLM?
High Throughput 2-3x faster than standard inference with continuous batching
Memory Efficient PagedAttention reduces memory waste by up to 80%
Easy Integration Compatible with HuggingFace models and OpenAI API format
Multi-GPU Support Built-in tensor parallelism for distributed inference
Installation
Verify Installation
python -c "import vllm; print(vllm.__version__)"
Using Docker (Recommended)
docker pull qwenllm/qwen:cu121
docker run --gpus all -it --rm qwenllm/qwen:cu121 bash
vLLM requires CUDA 11.4 or higher and a GPU with compute capability 7.0 or higher.
GPU Requirements
Memory Requirements by Model Size
Model seq_len 2048 seq_len 8192 seq_len 16384 seq_len 32768 Qwen-1.8B 6.22GB 7.46GB - - Qwen-7B 17.94GB 20.96GB - - Qwen-7B-Int4 9.10GB 12.26GB - - Qwen-14B 33.40GB - - - Qwen-14B-Int4 13.30GB - - - Qwen-72B 166.87GB 185.50GB 210.80GB 253.80GB Qwen-72B-Int4 55.37GB 73.66GB 97.79GB 158.80GB
Supported Consumer GPUs
GPU Memory GPU Models Supported Qwen Models 24GB RTX 4090/3090/A5000 Qwen-1.8B, Qwen-7B, Qwen-7B-Int4, Qwen-14B-Int4 16GB RTX A4000 Qwen-1.8B, Qwen-7B-Int4, Qwen-14B-Int4 12GB RTX 3080Ti Qwen-1.8B, Qwen-14B-Int4 11GB RTX 2080Ti Qwen-1.8B
Bfloat16 requires GPU compute capability ≥ 8.0. For older GPUs, use --dtype float16.
Quick Start
Standalone OpenAI API Server
Deploy an OpenAI-compatible API server with vLLM:
Single GPU
Multi-GPU (4 GPUs)
Int4 Model
Custom Port and Host
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-7B-Chat \
--trust-remote-code \
--dtype bfloat16 \
--chat-template template_chatml.jinja
Chat Template Configuration
Download and use the ChatML template for proper formatting:
# Download template
wget https://raw.githubusercontent.com/QwenLM/Qwen/main/examples/template_chatml.jinja
# Use with vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-7B-Chat \
--trust-remote-code \
--chat-template template_chatml.jinja
The chat template file is required for proper message formatting with the Qwen models.
Python Wrapper
Use the vLLM wrapper for Transformers-like interface:
Download the Wrapper
wget https://raw.githubusercontent.com/QwenLM/Qwen/main/examples/vllm_wrapper.py
Use in Python
from vllm_wrapper import vLLMWrapper
# Single GPU
model = vLLMWrapper( 'Qwen/Qwen-7B-Chat' , tensor_parallel_size = 1 )
# Multi-GPU (4 GPUs)
# model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=4)
# Int4 model
# model = vLLMWrapper('Qwen/Qwen-7B-Chat-Int4',
# tensor_parallel_size=1,
# dtype="float16")
# Chat interface
response, history = model.chat( query = "Hello, who are you?" , history = None )
print (response)
response, history = model.chat(
query = "Tell me about quantum computing" ,
history = history
)
print (response)
Wrapper Configuration
from vllm_wrapper import vLLMWrapper
model = vLLMWrapper(
model_dir = 'Qwen/Qwen-7B-Chat' ,
trust_remote_code = True ,
tensor_parallel_size = 1 , # Number of GPUs
gpu_memory_utilization = 0.98 , # GPU memory fraction
dtype = 'bfloat16' , # 'bfloat16', 'float16', 'float32'
max_model_len = 8192 , # Maximum sequence length
)
API Usage
Using OpenAI Python Client
Basic Chat
Streaming
With Parameters
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"
response = openai.ChatCompletion.create(
model = "Qwen" ,
messages = [
{ "role" : "user" , "content" : "What is artificial intelligence?" }
],
stream = False ,
stop_token_ids = [ 151645 ] # Required for vLLM
)
print (response.choices[ 0 ].message.content)
For vLLM standalone API, you must set stop_token_ids=[151645] or stop=["<|im_end|>"] to prevent infinite generation.
Advanced Configuration
Maximum Throughput
Memory Optimization
Low Latency
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-7B-Chat \
--trust-remote-code \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 256 \
--chat-template template_chatml.jinja
Configuration Parameters
Model name or path (HuggingFace format)
Number of GPUs for tensor parallelism
Data type: auto, bfloat16, float16, float32
Maximum sequence length (prompt + generation)
Fraction of GPU memory to use (0.0 to 1.0)
Maximum number of sequences processed in parallel
Maximum tokens processed in a batch
CPU swap space size in GB
Disable request logging for reduced overhead
Multi-GPU Deployment
Tensor Parallelism
Distribute model layers across multiple GPUs:
# 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-14B-Chat \
--trust-remote-code \
--tensor-parallel-size 2 \
--dtype bfloat16
# 4 GPUs for Qwen-72B
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-72B-Chat \
--trust-remote-code \
--tensor-parallel-size 4 \
--dtype bfloat16
# 8 GPUs for maximum performance
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-72B-Chat \
--trust-remote-code \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--max-num-seqs 512
GPU Selection
Control which GPUs to use:
# Use specific GPUs
CUDA_VISIBLE_DEVICES = 0,1,2,3 python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-72B-Chat \
--trust-remote-code \
--tensor-parallel-size 4
# Use GPUs on different nodes (requires Ray)
ray start --head
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-72B-Chat \
--trust-remote-code \
--tensor-parallel-size 8 \
--distributed-executor-backend ray
Production Deployment
Systemd Service
Create /etc/systemd/system/qwen-vllm.service:
[Unit]
Description =Qwen vLLM OpenAI API Server
After =network.target
[Service]
Type =simple
User =qwen
WorkingDirectory =/opt/qwen
Environment = "PATH=/opt/qwen/venv/bin:/usr/local/cuda/bin"
Environment = "CUDA_VISIBLE_DEVICES=0,1,2,3"
ExecStart =/opt/qwen/venv/bin/python -m vllm.entrypoints.openai.api_server \
--model /models/Qwen-72B-Chat \
--trust-remote-code \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--chat-template /opt/qwen/template_chatml.jinja
Restart =always
RestartSec =10
StandardOutput =append:/var/log/qwen-vllm/output.log
StandardError =append:/var/log/qwen-vllm/error.log
[Install]
WantedBy =multi-user.target
Manage the service:
sudo systemctl daemon-reload
sudo systemctl enable qwen-vllm
sudo systemctl start qwen-vllm
sudo systemctl status qwen-vllm
Docker Deployment
docker run --gpus all -d \
--name qwen-vllm \
--restart always \
-p 8000:8000 \
-v /models:/models:ro \
-v /templates:/templates:ro \
qwenllm/qwen:cu121 \
python -m vllm.entrypoints.openai.api_server \
--model /models/Qwen-7B-Chat \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--chat-template /templates/template_chatml.jinja
Load Balancing
Nginx configuration for multiple vLLM instances:
upstream vllm_backend {
least_conn ;
server 127.0.0.1:8000 max_fails = 3 fail_timeout=30s;
server 127.0.0.1:8001 max_fails = 3 fail_timeout=30s;
server 127.0.0.1:8002 max_fails = 3 fail_timeout=30s;
}
server {
listen 80 ;
server_name api.example.com;
location / {
proxy_pass http://vllm_backend;
proxy_set_header Host $ host ;
proxy_set_header X-Real-IP $ remote_addr ;
proxy_buffering off ;
proxy_cache off ;
proxy_read_timeout 300s ;
proxy_send_timeout 300s ;
}
}
Throughput Comparison
Qwen-7B on A100 80GB GPU:
Method Throughput (tokens/s) Latency (ms/token) Max Batch Size PyTorch 40.93 24.4 1-4 vLLM 68.5 14.6 256+ vLLM (4 GPUs) 245.2 4.1 1024+
Memory Efficiency
Qwen-72B memory usage:
Configuration GPU Memory Supported Batch Size PyTorch (2xA100) 144.69GB 1-2 vLLM (2xA100) 165GB 64 vLLM (4xA100) 166GB 256+
Limitations
Current vLLM Limitations with Qwen:
Dynamic NTK ROPE : vLLM does not support dynamic NTK ROPE scaling. Long sequence generation quality may degrade.
Context Length : Maximum context length is fixed at model initialization. Cannot dynamically extend beyond max_model_len.
Repetition Penalty : Requires vLLM ≥ 0.2.2 for repetition penalty support.
Troubleshooting
Error : torch.cuda.OutOfMemoryErrorSolutions :
Reduce --gpu-memory-utilization (try 0.85 or 0.80)
Decrease --max-model-len
Use quantized Int4 model
Increase --tensor-parallel-size
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-7B-Chat-Int4 \
--dtype float16 \
--gpu-memory-utilization 0.85 \
--max-model-len 4096
Error : ValueError: trust_remote_code is requiredSolution : Always include --trust-remote-code:python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-7B-Chat \
--trust-remote-code
Issue : Model generates indefinitelySolution : Set proper stop tokens:response = openai.ChatCompletion.create(
model = "Qwen" ,
messages = [ ... ],
stop_token_ids = [ 151645 ] # Essential!
)
Issue : Not achieving expected performanceSolutions :
Increase --max-num-seqs for more concurrent requests
Use --dtype bfloat16 instead of float16/float32
Disable request logging with --disable-log-requests
Check GPU utilization with nvidia-smi
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-7B-Chat \
--trust-remote-code \
--max-num-seqs 512 \
--dtype bfloat16 \
--disable-log-requests
Error : Issues with multi-GPU deploymentSolutions :
Ensure all GPUs have same model
Check NCCL configuration
Verify GPU visibility:
nvidia-smi
echo $CUDA_VISIBLE_DEVICES
ray start --head
python -m vllm.entrypoints.openai.api_server \
--tensor-parallel-size 4 \
--distributed-executor-backend ray
Monitoring
Health Checks
# Check if server is running
curl http://localhost:8000/health
# List available models
curl http://localhost:8000/v1/models
# Test inference
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen",
"messages": [{"role": "user", "content": "test"}],
"max_tokens": 10,
"stop_token_ids": [151645]
}'
Metrics Collection
vLLM exposes Prometheus metrics:
curl http://localhost:8000/metrics
Next Steps
FastChat Integration Add web UI and more features with FastChat
Production Guide Production deployment best practices
Performance Tuning Advanced performance optimization
Monitoring Setup Set up comprehensive monitoring