Skip to main content
Qwen models can be deployed in various ways depending on your requirements for performance, scalability, and ease of use. This guide covers all available deployment options.

Deployment Options

OpenAI-Compatible API

Deploy a production-ready API server compatible with OpenAI’s format

Docker Deployment

Containerized deployment with pre-built Docker images

vLLM

High-performance inference with vLLM for production workloads

FastChat

Full-featured deployment with web UI and API server

Choosing a Deployment Method

Select the deployment method that best fits your use case:

Quick Testing & Development

For rapid prototyping and local development:
  • OpenAI-Compatible API: Simple Python script deployment
  • Docker: Pre-configured environment without manual setup

Production Deployments

For production environments requiring high performance:
  • vLLM: Best for high-throughput inference with multiple concurrent requests
  • FastChat + vLLM: Complete solution with web UI and optimized inference

Scalability Considerations

All deployment methods support single GPU setups:
  • Qwen-1.8B: 4-6GB VRAM
  • Qwen-7B: 17-20GB VRAM (RTX 3090/4090)
  • Qwen-14B: 30-35GB VRAM (A100 40GB)
  • Qwen-72B: 145GB+ VRAM (requires multi-GPU)
For larger models or higher throughput:
  • vLLM Tensor Parallelism: Split model across multiple GPUs
  • Pipeline Parallelism: Sequential processing across GPUs
  • Recommended for Qwen-72B and high-concurrency scenarios
Reduce memory requirements with minimal quality loss:
  • Int8: ~40% memory reduction
  • Int4: ~70% memory reduction
  • Supported by all deployment methods

Requirements

System Requirements

python >= 3.8
pytorch >= 1.12 (2.0+ recommended)
transformers >= 4.32.0
CUDA >= 11.4 (for GPU deployments)

Hardware Requirements

Model SizeMinimum GPU MemoryRecommended GPUQuantization Options
Qwen-1.8B4GBGTX 1080 TiInt8, Int4
Qwen-7B16GBRTX 3090Int8, Int4
Qwen-14B30GBA100 40GBInt8, Int4
Qwen-72B145GB2x A100 80GBInt8, Int4

Performance Comparison

Benchmark on A100 GPU with Qwen-7B-Chat (generating 2048 tokens):
Deployment MethodThroughput (tokens/s)Memory UsageSetup Complexity
Native PyTorch40.9316.99GBLow
OpenAI API40.9316.99GBLow
Docker40.9316.99GBVery Low
vLLM60-8017.5GBMedium
FastChat + vLLM60-8017.5GBMedium
vLLM provides significant performance improvements through optimized CUDA kernels, continuous batching, and PagedAttention.

Security Considerations

When deploying in production, always implement proper security measures:
  • Use authentication (API keys, basic auth, OAuth)
  • Enable HTTPS/TLS encryption
  • Implement rate limiting
  • Monitor and log API usage
  • Keep dependencies updated

Next Steps

1

Choose Your Deployment Method

Review the options above and select based on your requirements
2

Follow the Setup Guide

Navigate to the specific deployment guide for detailed instructions
3

Configure for Production

Review Production Best Practices for optimization
4

Monitor and Scale

Set up monitoring and scale based on your traffic patterns

Common Issues

Solutions:
  • Use quantized models (Int4/Int8)
  • Enable KV cache quantization
  • Reduce max_model_len parameter
  • Use multi-GPU deployment
Solutions:
  • Install Flash Attention 2
  • Use vLLM for production workloads
  • Enable tensor parallelism for multi-GPU
  • Use bfloat16 instead of float32
Solutions:
  • Ensure trust_remote_code=True is set
  • Verify checkpoint path is correct
  • Check CUDA and PyTorch compatibility
  • Update transformers library

Additional Resources

Build docs developers (and LLMs) love