Overview
This guide covers best practices, architectural patterns, and operational considerations for running vLLM in production at scale.
Architecture patterns
Single-instance deployment
Simplest deployment for low-to-medium traffic:
┌─────────────┐
│ Client │
└──────┬──────┘
│
v
┌─────────────┐
│ vLLM Pod │
│ (1x GPU) │
└─────────────┘
Use when:
- QPS < 10
- Single model serving
- Development/testing environments
Load-balanced deployment
Multiple replicas behind a load balancer:
┌─────────────┐
│ Client │
└──────┬──────┘
│
v
┌─────────────────┐
│ Load Balancer │
│ (Nginx/K8s) │
└────────┬────────┘
│
┌────┴────┬────────┬────────┐
v v v v
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ vLLM │ │ vLLM │ │ vLLM │ │ vLLM │
│ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod N │
└───────┘ └───────┘ └───────┘ └───────┘
Use when:
- QPS > 10
- High availability required
- Horizontal scaling needed
Multi-model deployment
Serve multiple models with routing:
┌─────────────┐
│ Client │
└──────┬──────┘
│
v
┌─────────────────┐
│ Model Router │
└────────┬────────┘
│
┌────┴─────┬──────────┐
v v v
┌────────┐ ┌────────┐ ┌────────┐
│ Model │ │ Model │ │ Model │
│ 7B │ │ 13B │ │ 70B │
└────────┘ └────────┘ └────────┘
Use when:
- Multiple models needed
- Different performance tiers
- Cost optimization
Load balancing
Nginx configuration
Create Nginx configuration
upstream vllm_backend {
least_conn; # Use least connections algorithm
server vllm0:8000 max_fails=3 fail_timeout=30s;
server vllm1:8000 max_fails=3 fail_timeout=30s;
server vllm2:8000 max_fails=3 fail_timeout=30s;
server vllm3:8000 max_fails=3 fail_timeout=30s;
keepalive 32; # Connection pooling
}
server {
listen 80;
location / {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts for long-running requests
proxy_connect_timeout 300s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
}
location /health {
proxy_pass http://vllm_backend/health;
proxy_http_version 1.1;
}
}
Deploy with Docker Compose
version: '3.8'
services:
nginx:
image: nginx:latest
ports:
- "8000:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- vllm0
- vllm1
- vllm2
- vllm3
networks:
- vllm-network
vllm0:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: 10gb
ipc: host
command: --model meta-llama/Meta-Llama-3-8B-Instruct
networks:
- vllm-network
vllm1:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=1
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: 10gb
ipc: host
command: --model meta-llama/Meta-Llama-3-8B-Instruct
networks:
- vllm-network
vllm2:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=2
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: 10gb
ipc: host
command: --model meta-llama/Meta-Llama-3-8B-Instruct
networks:
- vllm-network
vllm3:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=3
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: 10gb
ipc: host
command: --model meta-llama/Meta-Llama-3-8B-Instruct
networks:
- vllm-network
networks:
vllm-network:
driver: bridge
Kubernetes Service with session affinity
Enable prefix caching by routing requests to the same pod:
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600 # 1 hour
Session affinity improves cache hit rates for prefix caching, reducing latency and cost.
Model configuration
Optimal vLLM settings for production:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--disable-log-requests \
--trust-remote-code
Key parameters:
| Parameter | Recommended | Purpose |
|---|
gpu-memory-utilization | 0.85-0.90 | Leave headroom for fragmentation |
max-model-len | Model-specific | Reduce for higher throughput |
max-num-seqs | 128-256 | Balance latency vs throughput |
enable-prefix-caching | true | Cache common prompts |
enable-chunked-prefill | true | Reduce TTFT for long prompts |
disable-log-requests | true | Reduce logging overhead |
Quantization
Reduce memory usage and increase throughput:
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 4
Quantization comparison:
| Method | Memory Savings | Quality | Speed |
|---|
| FP16 (baseline) | 0% | 100% | 1.0x |
| FP8 | 50% | 98-99% | 1.5-2.0x |
| AWQ/GPTQ | 75% | 95-98% | 1.2-1.5x |
Multi-GPU tensor parallelism
For large models, split across multiple GPUs:
# 70B model on 4x A100 GPUs
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
Tensor parallelism requires high-bandwidth interconnects (NVLink, InfiniBand). Use on single-node multi-GPU systems.
Monitoring and observability
Prometheus metrics
vLLM exposes Prometheus metrics at /metrics:
apiVersion: v1
kind: Service
metadata:
name: vllm-metrics
labels:
app: vllm
spec:
ports:
- name: metrics
port: 8000
targetPort: 8000
selector:
app: vllm
---
apiVersion: v1
kind: ServiceMonitor
metadata:
name: vllm-monitor
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: metrics
path: /metrics
Key metrics to monitor:
vllm:num_requests_running - Active requests
vllm:num_requests_waiting - Queued requests
vllm:gpu_cache_usage_perc - GPU memory utilization
vllm:avg_generation_throughput_toks_per_s - Throughput
vllm:time_to_first_token_seconds - TTFT latency
vllm:time_per_output_token_seconds - Generation latency
Grafana dashboard
Example Grafana queries:
# Request rate
rate(vllm:request_success_total[5m])
# Average TTFT
rate(vllm:time_to_first_token_seconds_sum[5m]) / rate(vllm:time_to_first_token_seconds_count[5m])
# P95 generation latency
histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))
# GPU utilization
vllm:gpu_cache_usage_perc
OpenTelemetry tracing
Enable distributed tracing:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--otlp-traces-endpoint http://jaeger:4318/v1/traces
Health checks and probes
Kubernetes probes
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 60 # 10 minutes for large models
Set failureThreshold high enough for large models to load. A 70B model can take 5-10 minutes to initialize.
Autoscaling
Horizontal Pod Autoscaler (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: "50" # Scale when >50 concurrent requests per pod
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 30
SkyPilot autoscaling
service:
replica_policy:
min_replicas: 2
max_replicas: 10
target_qps_per_replica: 5 # Scale when QPS > 5 per replica
upscale_delay_seconds: 60
downscale_delay_seconds: 300
Security best practices
API authentication
Use API keys for authentication:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--api-key your-secret-key
Client usage:
import openai
client = openai.OpenAI(
base_url="http://vllm:8000/v1",
api_key="your-secret-key"
)
Network policies
Restrict pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vllm-network-policy
spec:
podSelector:
matchLabels:
app: vllm
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
role: api-gateway
ports:
- protocol: TCP
port: 8000
egress:
- to:
- podSelector:
matchLabels:
role: model-storage
Secrets management
Use Kubernetes secrets or cloud secret managers:
apiVersion: v1
kind: Secret
metadata:
name: hf-token
type: Opaque
stringData:
token: hf_xxxxxxxxxxxxx
---
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
Disaster recovery
Model checkpointing
Store models in persistent storage:
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-models
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
Multi-region deployment
Deploy across multiple regions for high availability:
┌──────────────┐ ┌──────────────┐
│ Region 1 │ │ Region 2 │
│ (Primary) │ │ (Failover) │
├──────────────┤ ├──────────────┤
│ vLLM Cluster │ │ vLLM Cluster │
│ (3 pods) │ │ (3 pods) │
└──────────────┘ └──────────────┘
│ │
└────────┬───────────┘
v
┌──────────────┐
│ Global Load │
│ Balancer │
└──────────────┘
Cost optimization
Right-size GPU allocation
Match GPU to model size:
- 7B models: T4 (16GB) or L4 (24GB)
- 13B models: L4 (24GB) or A10G (24GB)
- 70B models: A100 40GB x2 or A100 80GB x1
Use quantization
Reduce GPU requirements with AWQ/GPTQ/FP8 quantization.
Enable autoscaling
Scale to zero during off-peak hours.
Batch requests
Use continuous batching to maximize throughput.
Enable prefix caching
Cache common system prompts to reduce compute.
Troubleshooting
High latency
Symptoms: Slow response times
Solutions:
- Check GPU utilization with
nvidia-smi
- Reduce
max-model-len to free memory
- Enable chunked prefill
- Add more replicas
- Enable quantization
OOM errors
Symptoms: CUDA out of memory
Solutions:
- Reduce
gpu-memory-utilization to 0.85
- Reduce
max-num-seqs
- Reduce
max-model-len
- Enable quantization
- Use tensor parallelism
Request timeouts
Symptoms: 504 Gateway Timeout
Solutions:
- Increase proxy timeouts in Nginx/K8s
- Increase
readinessProbe timeout
- Check for deadlocked requests with metrics
- Review
max-num-batched-tokens
Checklist
Before going to production:
Next steps