Skip to main content
vLLM is a high-throughput and memory-efficient inference engine for large language models. It provides significant performance improvements over standard PyTorch inference through continuous batching, PagedAttention, and optimized CUDA kernels.

Why vLLM?

High Throughput

2-3x faster than standard inference with continuous batching

Memory Efficient

PagedAttention reduces memory waste by up to 80%

Easy Integration

Compatible with HuggingFace models and OpenAI API format

Multi-GPU Support

Built-in tensor parallelism for distributed inference

Installation

1

Install vLLM

For CUDA 12.1 and PyTorch 2.1:
pip install vllm
For other CUDA versions, see vLLM Installation Guide
2

Verify Installation

python -c "import vllm; print(vllm.__version__)"
3

Using Docker (Recommended)

docker pull qwenllm/qwen:cu121
docker run --gpus all -it --rm qwenllm/qwen:cu121 bash
vLLM requires CUDA 11.4 or higher and a GPU with compute capability 7.0 or higher.

GPU Requirements

Memory Requirements by Model Size

Modelseq_len 2048seq_len 8192seq_len 16384seq_len 32768
Qwen-1.8B6.22GB7.46GB--
Qwen-7B17.94GB20.96GB--
Qwen-7B-Int49.10GB12.26GB--
Qwen-14B33.40GB---
Qwen-14B-Int413.30GB---
Qwen-72B166.87GB185.50GB210.80GB253.80GB
Qwen-72B-Int455.37GB73.66GB97.79GB158.80GB

Supported Consumer GPUs

GPU MemoryGPU ModelsSupported Qwen Models
24GBRTX 4090/3090/A5000Qwen-1.8B, Qwen-7B, Qwen-7B-Int4, Qwen-14B-Int4
16GBRTX A4000Qwen-1.8B, Qwen-7B-Int4, Qwen-14B-Int4
12GBRTX 3080TiQwen-1.8B, Qwen-14B-Int4
11GBRTX 2080TiQwen-1.8B
Bfloat16 requires GPU compute capability ≥ 8.0. For older GPUs, use --dtype float16.

Quick Start

Standalone OpenAI API Server

Deploy an OpenAI-compatible API server with vLLM:
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16 \
  --chat-template template_chatml.jinja

Chat Template Configuration

Download and use the ChatML template for proper formatting:
# Download template
wget https://raw.githubusercontent.com/QwenLM/Qwen/main/examples/template_chatml.jinja

# Use with vLLM
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --chat-template template_chatml.jinja
The chat template file is required for proper message formatting with the Qwen models.

Python Wrapper

Use the vLLM wrapper for Transformers-like interface:
1

Download the Wrapper

wget https://raw.githubusercontent.com/QwenLM/Qwen/main/examples/vllm_wrapper.py
2

Use in Python

from vllm_wrapper import vLLMWrapper

# Single GPU
model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)

# Multi-GPU (4 GPUs)
# model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=4)

# Int4 model
# model = vLLMWrapper('Qwen/Qwen-7B-Chat-Int4', 
#                     tensor_parallel_size=1, 
#                     dtype="float16")

# Chat interface
response, history = model.chat(query="Hello, who are you?", history=None)
print(response)

response, history = model.chat(
    query="Tell me about quantum computing", 
    history=history
)
print(response)

Wrapper Configuration

from vllm_wrapper import vLLMWrapper

model = vLLMWrapper(
    model_dir='Qwen/Qwen-7B-Chat',
    trust_remote_code=True,
    tensor_parallel_size=1,        # Number of GPUs
    gpu_memory_utilization=0.98,   # GPU memory fraction
    dtype='bfloat16',              # 'bfloat16', 'float16', 'float32'
    max_model_len=8192,            # Maximum sequence length
)

API Usage

Using OpenAI Python Client

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"

response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "user", "content": "What is artificial intelligence?"}
    ],
    stream=False,
    stop_token_ids=[151645]  # Required for vLLM
)

print(response.choices[0].message.content)
For vLLM standalone API, you must set stop_token_ids=[151645] or stop=["<|im_end|>"] to prevent infinite generation.

Advanced Configuration

Performance Tuning

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --chat-template template_chatml.jinja

Configuration Parameters

--model
string
required
Model name or path (HuggingFace format)
--tensor-parallel-size
int
default:"1"
Number of GPUs for tensor parallelism
--dtype
string
default:"auto"
Data type: auto, bfloat16, float16, float32
--max-model-len
int
Maximum sequence length (prompt + generation)
--gpu-memory-utilization
float
default:"0.90"
Fraction of GPU memory to use (0.0 to 1.0)
--max-num-seqs
int
default:"256"
Maximum number of sequences processed in parallel
--max-num-batched-tokens
int
Maximum tokens processed in a batch
--swap-space
int
default:"4"
CPU swap space size in GB
--disable-log-requests
boolean
Disable request logging for reduced overhead

Multi-GPU Deployment

Tensor Parallelism

Distribute model layers across multiple GPUs:
# 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-14B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --dtype bfloat16

# 4 GPUs for Qwen-72B
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --dtype bfloat16

# 8 GPUs for maximum performance
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --max-num-seqs 512

GPU Selection

Control which GPUs to use:
# Use specific GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4

# Use GPUs on different nodes (requires Ray)
ray start --head
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --distributed-executor-backend ray

Production Deployment

Systemd Service

Create /etc/systemd/system/qwen-vllm.service:
[Unit]
Description=Qwen vLLM OpenAI API Server
After=network.target

[Service]
Type=simple
User=qwen
WorkingDirectory=/opt/qwen
Environment="PATH=/opt/qwen/venv/bin:/usr/local/cuda/bin"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
ExecStart=/opt/qwen/venv/bin/python -m vllm.entrypoints.openai.api_server \
  --model /models/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --chat-template /opt/qwen/template_chatml.jinja
Restart=always
RestartSec=10
StandardOutput=append:/var/log/qwen-vllm/output.log
StandardError=append:/var/log/qwen-vllm/error.log

[Install]
WantedBy=multi-user.target
Manage the service:
sudo systemctl daemon-reload
sudo systemctl enable qwen-vllm
sudo systemctl start qwen-vllm
sudo systemctl status qwen-vllm

Docker Deployment

docker run --gpus all -d \
  --name qwen-vllm \
  --restart always \
  -p 8000:8000 \
  -v /models:/models:ro \
  -v /templates:/templates:ro \
  qwenllm/qwen:cu121 \
  python -m vllm.entrypoints.openai.api_server \
    --model /models/Qwen-7B-Chat \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000 \
    --chat-template /templates/template_chatml.jinja

Load Balancing

Nginx configuration for multiple vLLM instances:
upstream vllm_backend {
    least_conn;
    server 127.0.0.1:8000 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:8001 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:8002 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://vllm_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Performance Benchmarks

Throughput Comparison

Qwen-7B on A100 80GB GPU:
MethodThroughput (tokens/s)Latency (ms/token)Max Batch Size
PyTorch40.9324.41-4
vLLM68.514.6256+
vLLM (4 GPUs)245.24.11024+

Memory Efficiency

Qwen-72B memory usage:
ConfigurationGPU MemorySupported Batch Size
PyTorch (2xA100)144.69GB1-2
vLLM (2xA100)165GB64
vLLM (4xA100)166GB256+

Limitations

Current vLLM Limitations with Qwen:
  1. Dynamic NTK ROPE: vLLM does not support dynamic NTK ROPE scaling. Long sequence generation quality may degrade.
  2. Context Length: Maximum context length is fixed at model initialization. Cannot dynamically extend beyond max_model_len.
  3. Repetition Penalty: Requires vLLM ≥ 0.2.2 for repetition penalty support.

Troubleshooting

Error: torch.cuda.OutOfMemoryErrorSolutions:
  • Reduce --gpu-memory-utilization (try 0.85 or 0.80)
  • Decrease --max-model-len
  • Use quantized Int4 model
  • Increase --tensor-parallel-size
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat-Int4 \
  --dtype float16 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 4096
Error: ValueError: trust_remote_code is requiredSolution: Always include --trust-remote-code:
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code
Issue: Model generates indefinitelySolution: Set proper stop tokens:
response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[...],
    stop_token_ids=[151645]  # Essential!
)
Issue: Not achieving expected performanceSolutions:
  • Increase --max-num-seqs for more concurrent requests
  • Use --dtype bfloat16 instead of float16/float32
  • Disable request logging with --disable-log-requests
  • Check GPU utilization with nvidia-smi
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --max-num-seqs 512 \
  --dtype bfloat16 \
  --disable-log-requests
Error: Issues with multi-GPU deploymentSolutions:
  • Ensure all GPUs have same model
  • Check NCCL configuration
  • Verify GPU visibility:
nvidia-smi
echo $CUDA_VISIBLE_DEVICES
  • Test with Ray backend:
ray start --head
python -m vllm.entrypoints.openai.api_server \
  --tensor-parallel-size 4 \
  --distributed-executor-backend ray

Monitoring

Health Checks

# Check if server is running
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Test inference
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen",
    "messages": [{"role": "user", "content": "test"}],
    "max_tokens": 10,
    "stop_token_ids": [151645]
  }'

Metrics Collection

vLLM exposes Prometheus metrics:
curl http://localhost:8000/metrics

Next Steps

FastChat Integration

Add web UI and more features with FastChat

Production Guide

Production deployment best practices

Performance Tuning

Advanced performance optimization

Monitoring Setup

Set up comprehensive monitoring

Build docs developers (and LLMs) love