Skip to main content

Overview

This guide covers best practices, architectural patterns, and operational considerations for running vLLM in production at scale.

Architecture patterns

Single-instance deployment

Simplest deployment for low-to-medium traffic:
┌─────────────┐
│   Client    │
└──────┬──────┘

       v
┌─────────────┐
│  vLLM Pod   │
│  (1x GPU)   │
└─────────────┘
Use when:
  • QPS < 10
  • Single model serving
  • Development/testing environments

Load-balanced deployment

Multiple replicas behind a load balancer:
┌─────────────┐
│   Client    │
└──────┬──────┘

       v
┌─────────────────┐
│ Load Balancer   │
│  (Nginx/K8s)    │
└────────┬────────┘

    ┌────┴────┬────────┬────────┐
    v         v        v        v
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ vLLM  │ │ vLLM  │ │ vLLM  │ │ vLLM  │
│ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod N │
└───────┘ └───────┘ └───────┘ └───────┘
Use when:
  • QPS > 10
  • High availability required
  • Horizontal scaling needed

Multi-model deployment

Serve multiple models with routing:
┌─────────────┐
│   Client    │
└──────┬──────┘

       v
┌─────────────────┐
│  Model Router   │
└────────┬────────┘

    ┌────┴─────┬──────────┐
    v          v          v
┌────────┐ ┌────────┐ ┌────────┐
│ Model  │ │ Model  │ │ Model  │
│  7B    │ │  13B   │ │  70B   │
└────────┘ └────────┘ └────────┘
Use when:
  • Multiple models needed
  • Different performance tiers
  • Cost optimization

Load balancing

Nginx configuration

1

Create Nginx configuration

upstream vllm_backend {
    least_conn;  # Use least connections algorithm
    server vllm0:8000 max_fails=3 fail_timeout=30s;
    server vllm1:8000 max_fails=3 fail_timeout=30s;
    server vllm2:8000 max_fails=3 fail_timeout=30s;
    server vllm3:8000 max_fails=3 fail_timeout=30s;
    
    keepalive 32;  # Connection pooling
}

server {
    listen 80;
    
    location / {
        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # Timeouts for long-running requests
        proxy_connect_timeout 300s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }
    
    location /health {
        proxy_pass http://vllm_backend/health;
        proxy_http_version 1.1;
    }
}
2

Deploy with Docker Compose

version: '3.8'

services:
  nginx:
    image: nginx:latest
    ports:
      - "8000:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - vllm0
      - vllm1
      - vllm2
      - vllm3
    networks:
      - vllm-network

  vllm0:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm1:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm2:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=2
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm3:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=3
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

networks:
  vllm-network:
    driver: bridge

Kubernetes Service with session affinity

Enable prefix caching by routing requests to the same pod:
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600  # 1 hour
Session affinity improves cache hit rates for prefix caching, reducing latency and cost.

Performance optimization

Model configuration

Optimal vLLM settings for production:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --disable-log-requests \
  --trust-remote-code
Key parameters:
ParameterRecommendedPurpose
gpu-memory-utilization0.85-0.90Leave headroom for fragmentation
max-model-lenModel-specificReduce for higher throughput
max-num-seqs128-256Balance latency vs throughput
enable-prefix-cachingtrueCache common prompts
enable-chunked-prefilltrueReduce TTFT for long prompts
disable-log-requeststrueReduce logging overhead

Quantization

Reduce memory usage and increase throughput:
vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 4
Quantization comparison:
MethodMemory SavingsQualitySpeed
FP16 (baseline)0%100%1.0x
FP850%98-99%1.5-2.0x
AWQ/GPTQ75%95-98%1.2-1.5x

Multi-GPU tensor parallelism

For large models, split across multiple GPUs:
# 70B model on 4x A100 GPUs
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192
Tensor parallelism requires high-bandwidth interconnects (NVLink, InfiniBand). Use on single-node multi-GPU systems.

Monitoring and observability

Prometheus metrics

vLLM exposes Prometheus metrics at /metrics:
apiVersion: v1
kind: Service
metadata:
  name: vllm-metrics
  labels:
    app: vllm
spec:
  ports:
  - name: metrics
    port: 8000
    targetPort: 8000
  selector:
    app: vllm
---
apiVersion: v1
kind: ServiceMonitor
metadata:
  name: vllm-monitor
spec:
  selector:
    matchLabels:
      app: vllm
  endpoints:
  - port: metrics
    path: /metrics
Key metrics to monitor:
  • vllm:num_requests_running - Active requests
  • vllm:num_requests_waiting - Queued requests
  • vllm:gpu_cache_usage_perc - GPU memory utilization
  • vllm:avg_generation_throughput_toks_per_s - Throughput
  • vllm:time_to_first_token_seconds - TTFT latency
  • vllm:time_per_output_token_seconds - Generation latency

Grafana dashboard

Example Grafana queries:
# Request rate
rate(vllm:request_success_total[5m])

# Average TTFT
rate(vllm:time_to_first_token_seconds_sum[5m]) / rate(vllm:time_to_first_token_seconds_count[5m])

# P95 generation latency
histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))

# GPU utilization
vllm:gpu_cache_usage_perc

OpenTelemetry tracing

Enable distributed tracing:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --otlp-traces-endpoint http://jaeger:4318/v1/traces

Health checks and probes

Kubernetes probes

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 300
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 60  # 10 minutes for large models
Set failureThreshold high enough for large models to load. A 70B model can take 5-10 minutes to initialize.

Autoscaling

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_running
      target:
        type: AverageValue
        averageValue: "50"  # Scale when >50 concurrent requests per pod
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

SkyPilot autoscaling

service:
  replica_policy:
    min_replicas: 2
    max_replicas: 10
    target_qps_per_replica: 5  # Scale when QPS > 5 per replica
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300

Security best practices

API authentication

Use API keys for authentication:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --api-key your-secret-key
Client usage:
import openai

client = openai.OpenAI(
    base_url="http://vllm:8000/v1",
    api_key="your-secret-key"
)

Network policies

Restrict pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-network-policy
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: api-gateway
    ports:
    - protocol: TCP
      port: 8000
  egress:
  - to:
    - podSelector:
        matchLabels:
          role: model-storage

Secrets management

Use Kubernetes secrets or cloud secret managers:
apiVersion: v1
kind: Secret
metadata:
  name: hf-token
type: Opaque
stringData:
  token: hf_xxxxxxxxxxxxx
---
env:
- name: HF_TOKEN
  valueFrom:
    secretKeyRef:
      name: hf-token
      key: token

Disaster recovery

Model checkpointing

Store models in persistent storage:
volumes:
- name: model-cache
  persistentVolumeClaim:
    claimName: vllm-models
volumeMounts:
- name: model-cache
  mountPath: /root/.cache/huggingface

Multi-region deployment

Deploy across multiple regions for high availability:
┌──────────────┐     ┌──────────────┐
│  Region 1    │     │  Region 2    │
│  (Primary)   │     │  (Failover)  │
├──────────────┤     ├──────────────┤
│ vLLM Cluster │     │ vLLM Cluster │
│  (3 pods)    │     │  (3 pods)    │
└──────────────┘     └──────────────┘
        │                    │
        └────────┬───────────┘
                 v
         ┌──────────────┐
         │ Global Load  │
         │  Balancer    │
         └──────────────┘

Cost optimization

1

Right-size GPU allocation

Match GPU to model size:
  • 7B models: T4 (16GB) or L4 (24GB)
  • 13B models: L4 (24GB) or A10G (24GB)
  • 70B models: A100 40GB x2 or A100 80GB x1
2

Use quantization

Reduce GPU requirements with AWQ/GPTQ/FP8 quantization.
3

Enable autoscaling

Scale to zero during off-peak hours.
4

Batch requests

Use continuous batching to maximize throughput.
5

Enable prefix caching

Cache common system prompts to reduce compute.

Troubleshooting

High latency

Symptoms: Slow response times Solutions:
  1. Check GPU utilization with nvidia-smi
  2. Reduce max-model-len to free memory
  3. Enable chunked prefill
  4. Add more replicas
  5. Enable quantization

OOM errors

Symptoms: CUDA out of memory Solutions:
  1. Reduce gpu-memory-utilization to 0.85
  2. Reduce max-num-seqs
  3. Reduce max-model-len
  4. Enable quantization
  5. Use tensor parallelism

Request timeouts

Symptoms: 504 Gateway Timeout Solutions:
  1. Increase proxy timeouts in Nginx/K8s
  2. Increase readinessProbe timeout
  3. Check for deadlocked requests with metrics
  4. Review max-num-batched-tokens

Checklist

Before going to production:
  • Load testing completed (target QPS)
  • Monitoring and alerting configured
  • Health checks validated
  • Autoscaling tested
  • Disaster recovery plan documented
  • Security review completed
  • Cost analysis performed
  • SLO/SLA defined
  • Rollback procedure documented
  • On-call rotation established

Next steps

Build docs developers (and LLMs) love