Deployment Options
OpenAI-Compatible API
Deploy a production-ready API server compatible with OpenAI’s format
Docker Deployment
Containerized deployment with pre-built Docker images
vLLM
High-performance inference with vLLM for production workloads
FastChat
Full-featured deployment with web UI and API server
Choosing a Deployment Method
Select the deployment method that best fits your use case:Quick Testing & Development
For rapid prototyping and local development:- OpenAI-Compatible API: Simple Python script deployment
- Docker: Pre-configured environment without manual setup
Production Deployments
For production environments requiring high performance:- vLLM: Best for high-throughput inference with multiple concurrent requests
- FastChat + vLLM: Complete solution with web UI and optimized inference
Scalability Considerations
Single GPU Deployments
Single GPU Deployments
All deployment methods support single GPU setups:
- Qwen-1.8B: 4-6GB VRAM
- Qwen-7B: 17-20GB VRAM (RTX 3090/4090)
- Qwen-14B: 30-35GB VRAM (A100 40GB)
- Qwen-72B: 145GB+ VRAM (requires multi-GPU)
Multi-GPU Deployments
Multi-GPU Deployments
For larger models or higher throughput:
- vLLM Tensor Parallelism: Split model across multiple GPUs
- Pipeline Parallelism: Sequential processing across GPUs
- Recommended for Qwen-72B and high-concurrency scenarios
Quantized Models
Quantized Models
Reduce memory requirements with minimal quality loss:
- Int8: ~40% memory reduction
- Int4: ~70% memory reduction
- Supported by all deployment methods
Requirements
System Requirements
Hardware Requirements
| Model Size | Minimum GPU Memory | Recommended GPU | Quantization Options |
|---|---|---|---|
| Qwen-1.8B | 4GB | GTX 1080 Ti | Int8, Int4 |
| Qwen-7B | 16GB | RTX 3090 | Int8, Int4 |
| Qwen-14B | 30GB | A100 40GB | Int8, Int4 |
| Qwen-72B | 145GB | 2x A100 80GB | Int8, Int4 |
Performance Comparison
Benchmark on A100 GPU with Qwen-7B-Chat (generating 2048 tokens):| Deployment Method | Throughput (tokens/s) | Memory Usage | Setup Complexity |
|---|---|---|---|
| Native PyTorch | 40.93 | 16.99GB | Low |
| OpenAI API | 40.93 | 16.99GB | Low |
| Docker | 40.93 | 16.99GB | Very Low |
| vLLM | 60-80 | 17.5GB | Medium |
| FastChat + vLLM | 60-80 | 17.5GB | Medium |
vLLM provides significant performance improvements through optimized CUDA kernels, continuous batching, and PagedAttention.
Security Considerations
Next Steps
Configure for Production
Review Production Best Practices for optimization
Common Issues
Out of Memory Errors
Out of Memory Errors
Solutions:
- Use quantized models (Int4/Int8)
- Enable KV cache quantization
- Reduce
max_model_lenparameter - Use multi-GPU deployment
Slow Inference Speed
Slow Inference Speed
Solutions:
- Install Flash Attention 2
- Use vLLM for production workloads
- Enable tensor parallelism for multi-GPU
- Use bfloat16 instead of float32
Model Loading Failures
Model Loading Failures
Solutions:
- Ensure
trust_remote_code=Trueis set - Verify checkpoint path is correct
- Check CUDA and PyTorch compatibility
- Update transformers library