Skip to main content
This guide covers best practices, optimization strategies, and production considerations for deploying Qwen models at scale.

Architecture Design

Deployment Architecture

Inference Engine

vLLM for production workloads
  • High throughput
  • Memory efficient
  • Multi-GPU support

Orchestration

FastChat for management
  • Model routing
  • Load balancing
  • Web UI optional

Reverse Proxy

Nginx or Traefik
  • SSL termination
  • Rate limiting
  • Request routing

Monitoring

Prometheus + Grafana
  • Metrics collection
  • Alerting
  • Visualization

Performance Optimization

Model Selection

Select based on latency and throughput requirements:
ModelUse CaseLatencyQuality
Qwen-1.8BHigh-throughput, simple tasks~50msGood
Qwen-7BBalanced performance~100msExcellent
Qwen-14BComplex reasoning~150msSuperior
Qwen-72BMission-critical, highest quality~400msBest
Use quantization to reduce memory and improve throughput:
# Int4 quantization (recommended)
model = "Qwen/Qwen-7B-Chat-Int4"
# 70% memory reduction, minimal quality loss

# Int8 quantization
model = "Qwen/Qwen-7B-Chat-Int8"
# 40% memory reduction, negligible quality loss
Quality Comparison (MMLU scores):
  • BF16: 55.8
  • Int8: 55.4 (-0.4)
  • Int4: 55.1 (-0.7)
Set appropriate max_model_len based on use case:
# Short conversations (most use cases)
--max-model-len 4096

# Long documents
--max-model-len 8192

# Extended context (requires more memory)
--max-model-len 16384
Longer context increases memory usage linearly.

vLLM Configuration

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat-Int4 \
  --trust-remote-code \
  --dtype float16 \
  --max-model-len 4096 \
  --max-num-seqs 512 \
  --gpu-memory-utilization 0.95 \
  --disable-log-requests \
  --tensor-parallel-size 1

Multi-GPU Strategies

Split single model across GPUs:
# Best for large models (72B)
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --dtype bfloat16
Pros: Higher throughput per model Cons: All GPUs serve single model

Security

Authentication

from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader

API_KEYS = {"sk-key1", "sk-key2", "sk-key3"}
api_key_header = APIKeyHeader(name="X-API-Key")

async def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key not in API_KEYS:
        raise HTTPException(status_code=403, detail="Invalid API key")
    return api_key

@app.post("/v1/chat/completions", dependencies=[Depends(verify_api_key)])
async def chat_completion(request: ChatRequest):
    # Your endpoint logic
    pass

SSL/TLS Configuration

Nginx SSL configuration:
server {
    listen 443 ssl http2;
    server_name api.example.com;
    
    # SSL certificates
    ssl_certificate /etc/ssl/certs/api.example.com.crt;
    ssl_certificate_key /etc/ssl/private/api.example.com.key;
    
    # SSL configuration
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;
    
    # Security headers
    add_header Strict-Transport-Security "max-age=31536000" always;
    add_header X-Frame-Options "DENY" always;
    add_header X-Content-Type-Options "nosniff" always;
    
    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Rate Limiting

from fastapi import Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/v1/chat/completions")
@limiter.limit("100/minute")
async def chat_completion(request: Request):
    # Your endpoint logic
    pass

Monitoring & Observability

Prometheus Metrics

Expose metrics for monitoring:
from prometheus_client import Counter, Histogram, Gauge, make_asgi_app
import time

# Define metrics
request_count = Counter(
    'qwen_requests_total',
    'Total requests',
    ['model', 'status']
)

request_duration = Histogram(
    'qwen_request_duration_seconds',
    'Request duration',
    ['model']
)

active_requests = Gauge(
    'qwen_active_requests',
    'Active requests',
    ['model']
)

tokens_generated = Counter(
    'qwen_tokens_generated_total',
    'Total tokens generated',
    ['model']
)

# Add metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

@app.post("/v1/chat/completions")
async def chat_completion(request: ChatRequest):
    start_time = time.time()
    active_requests.labels(model=request.model).inc()
    
    try:
        response = await generate_response(request)
        request_count.labels(model=request.model, status="success").inc()
        tokens_generated.labels(model=request.model).inc(response.usage.completion_tokens)
        return response
    except Exception as e:
        request_count.labels(model=request.model, status="error").inc()
        raise
    finally:
        duration = time.time() - start_time
        request_duration.labels(model=request.model).observe(duration)
        active_requests.labels(model=request.model).dec()

Grafana Dashboard

Key metrics to monitor:

Throughput

  • Requests per second
  • Tokens per second
  • Batch size utilization

Latency

  • p50, p95, p99 response times
  • Time to first token (TTFT)
  • Inter-token latency

Resources

  • GPU utilization
  • GPU memory usage
  • CPU and system memory

Errors

  • Error rate
  • Timeout rate
  • Queue depth

Health Checks

Implement comprehensive health checks:
from fastapi import status
import psutil
import torch

@app.get("/health")
async def health_check():
    return {"status": "healthy"}

@app.get("/health/detailed")
async def detailed_health_check():
    # Check GPU
    gpu_available = torch.cuda.is_available()
    if gpu_available:
        gpu_memory = torch.cuda.get_device_properties(0).total_memory
        gpu_memory_used = torch.cuda.memory_allocated(0)
        gpu_utilization = (gpu_memory_used / gpu_memory) * 100
    else:
        gpu_utilization = None
    
    # Check system resources
    cpu_percent = psutil.cpu_percent(interval=1)
    memory = psutil.virtual_memory()
    
    # Check model
    model_loaded = model is not None
    
    health_status = {
        "status": "healthy" if model_loaded and gpu_available else "degraded",
        "model_loaded": model_loaded,
        "gpu": {
            "available": gpu_available,
            "utilization": gpu_utilization,
        },
        "system": {
            "cpu_percent": cpu_percent,
            "memory_percent": memory.percent,
            "memory_available_gb": memory.available / (1024**3)
        },
        "timestamp": datetime.utcnow().isoformat()
    }
    
    if health_status["status"] == "degraded":
        return JSONResponse(
            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
            content=health_status
        )
    
    return health_status

Scaling Strategies

Horizontal Scaling

1

Load Balancer Setup

Configure Nginx for multiple backends:
upstream qwen_backend {
    least_conn;  # Use least connections algorithm
    server 10.0.1.10:8000 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8000 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8000 max_fails=3 fail_timeout=30s;
}

server {
    location / {
        proxy_pass http://qwen_backend;
        proxy_next_upstream error timeout http_502 http_503;
    }
}
2

Session Affinity

For stateful applications:
upstream qwen_backend {
    ip_hash;  # Sticky sessions based on client IP
    server 10.0.1.10:8000;
    server 10.0.1.11:8000;
}
3

Auto-scaling

Use Kubernetes HPA or cloud auto-scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: qwen-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: qwen-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Vertical Scaling

Upgrade to larger GPUs or more GPUs per node:
CurrentUpgrade PathPerformance Gain
1x RTX 30901x A100 40GB1.5-2x throughput
1x A100 40GB1x A100 80GBLarger models/batches
1x A1002x A100 (TP)1.7-1.9x throughput
2x A100 (TP)4x A100 (TP)Support Qwen-72B

Disaster Recovery

Backup Strategy

# Backup to S3
aws s3 sync /models/Qwen-7B-Chat \
  s3://your-bucket/models/Qwen-7B-Chat/ \
  --exclude "*.git/*"

# Backup to local NAS
rsync -avz --progress /models/ backup-server:/backups/models/

Disaster Recovery Plan

1

Documentation

Maintain runbooks with:
  • System architecture diagrams
  • Deployment procedures
  • Rollback procedures
  • Contact information
2

Testing

Regularly test:
  • Failover procedures
  • Backup restoration
  • Load balancer health checks
  • Monitoring alerts
3

Automation

Automate recovery:
# Example recovery script
#!/bin/bash
set -e

echo "Starting disaster recovery..."

# Stop failed services
systemctl stop qwen-*

# Restore from backup
aws s3 sync s3://backup/models /models/

# Restart services
systemctl start qwen-controller
sleep 10
systemctl start qwen-worker
sleep 30
systemctl start qwen-api

# Verify health
curl -f http://localhost:8000/health || exit 1

echo "Recovery complete"

Cost Optimization

GPU Utilization

Higher batch sizes improve GPU utilization:
# Before: batch_size=1, 40% GPU util
--max-num-seqs 1

# After: batch_size=128, 85% GPU util
--max-num-seqs 128
Monitor with: nvidia-smi dmon -s u
For non-critical workloads:
# AWS Spot Instances: 70% cost savings
# Azure Spot VMs: 60-90% cost savings
# GCP Preemptible VMs: 80% cost savings
Implement graceful shutdown:
import signal

def handle_termination(signum, frame):
    print("Received termination signal, graceful shutdown...")
    # Finish current requests
    # Save state if needed
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_termination)
Match GPU to model size:
ModelMinimum GPURecommended GPUCost Efficiency
Qwen-7B-Int4RTX 3090A10High
Qwen-7BRTX 3090A100 40GBMedium
Qwen-14B-Int4A100 40GBA100 40GBHigh
Qwen-72B-Int42x A100 40GB2x A100 80GBMedium

Checklist

Use this checklist before going to production:

Pre-deployment

  • Model selection and quantization decided
  • GPU resources allocated and tested
  • Load testing completed
  • Security hardening applied
  • SSL/TLS certificates configured
  • Authentication mechanism implemented
  • Rate limiting configured
  • Monitoring and alerting set up
  • Backup strategy implemented
  • Documentation updated

Deployment

  • Services deployed with systemd/Docker
  • Health checks passing
  • Load balancer configured
  • Firewall rules applied
  • Logs being collected
  • Metrics being recorded
  • Alerts being received

Post-deployment

  • Performance benchmarks validated
  • Error rates within SLA
  • Resource utilization acceptable
  • Cost within budget
  • Team trained on operations
  • Runbooks tested
  • On-call rotation established

Troubleshooting

Performance Issues

  1. Check GPU utilization: nvidia-smi
  2. Review batch size: increase --max-num-seqs
  3. Check network latency between services
  4. Review logs for bottlenecks
  5. Consider tensor parallelism

Additional Resources

vLLM Documentation

Official vLLM documentation

FastChat GitHub

FastChat source and examples

Kubernetes Guide

Deploy on Kubernetes

Performance Tuning

Advanced optimization guide

Build docs developers (and LLMs) love