Docker provides a containerized environment for running Qwen models with all dependencies pre-configured. This is the easiest way to get started with production deployments.
Pre-built Docker Images
Qwen provides official Docker images on Docker Hub:
# CUDA 11.7 (default)
docker pull qwenllm/qwen:cu117
# CUDA 11.4
docker pull qwenllm/qwen:cu114
# CUDA 12.1 (latest)
docker pull qwenllm/qwen:cu121
Quick Start
Web Demo Deployment
Download the Deployment Script
git clone https://github.com/QwenLM/Qwen.git
cd Qwen/docker
Run the Web Demo
bash docker_web_demo.sh \
-c /path/to/Qwen-7B-Chat \
-n qwen-web \
--port 8901
Access the Interface
Open your browser and navigate to http://localhost:8901
OpenAI API Server Deployment
Run the API Server
bash docker_openai_api.sh \
-c /path/to/Qwen-7B-Chat \
-n qwen-api \
--port 8000
Test the API
curl http://localhost:8000/v1/models
CLI Demo Deployment
bash docker_cli_demo.sh \
-c /path/to/Qwen-7B-Chat \
-n qwen-cli
Manual Docker Commands
Basic Container Launch
docker run --gpus all -it --rm \
--name qwen-chat \
-v /path/to/models:/data/shared/Qwen/models \
-p 8000:8000 \
qwenllm/qwen:cu117 \
python openai_api.py -c /data/shared/Qwen/models/Qwen-7B-Chat --server-port 8000 --server-name 0.0.0.0
Persistent Container
For long-running deployments:
docker run --gpus all -d \
--name qwen-api \
--restart always \
-v /path/to/models:/models:ro \
-p 8000:80 \
qwenllm/qwen:cu117 \
python openai_api.py -c /models/Qwen-7B-Chat --server-port 80 --server-name 0.0.0.0
Enable GPU access for the container
Run container in detached mode (background)
Automatically restart container on failure or system reboot
Mount host directory to container. Use :ro for read-only access
Map container port to host port (host:container)
Custom Dockerfile
Build your own Docker image with specific requirements:
Basic Dockerfile
Production Dockerfile
Multi-stage Build
ARG CUDA_VERSION=11.7.1
FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04
# Install system dependencies
RUN apt update -y && apt upgrade -y && apt install -y \
git git-lfs python3 python3-pip python3-dev wget vim \
&& rm -rf /var/lib/apt/lists/*
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN git lfs install
# Create working directory
WORKDIR /workspace
# Install Python dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
RUN pip3 install --no-cache-dir -r requirements.txt
# Install Flash Attention (optional but recommended)
RUN pip3 install flash-attn --no-build-isolation
# Copy application code
COPY openai_api.py .
COPY cli_demo.py .
COPY web_demo.py .
EXPOSE 8000
CMD [ "python" , "openai_api.py" , "-c" , "/models/Qwen-Chat" , "--server-port" , "8000" , "--server-name" , "0.0.0.0" ]
Build and Run Custom Image
# Build the image
docker build -t qwen-custom:latest -f Dockerfile .
# Run the container
docker run --gpus all -d \
--name qwen-api \
-v /path/to/models:/models:ro \
-p 8000:8000 \
qwen-custom:latest \
-c /models/Qwen-7B-Chat
Docker Compose
Manage multi-container deployments with Docker Compose:
docker-compose.yml
docker-compose.multi-gpu.yml
version : '3.8'
services :
qwen-api :
image : qwenllm/qwen:cu121
container_name : qwen-api
restart : always
ports :
- "8000:8000"
volumes :
- /path/to/models:/models:ro
- ./logs:/workspace/logs
environment :
- CUDA_VISIBLE_DEVICES=0
command : >
python openai_api.py
-c /models/Qwen-7B-Chat
--server-port 8000
--server-name 0.0.0.0
deploy :
resources :
reservations :
devices :
- driver : nvidia
count : 1
capabilities : [ gpu ]
healthcheck :
test : [ "CMD" , "curl" , "-f" , "http://localhost:8000/health" ]
interval : 30s
timeout : 10s
retries : 3
start_period : 5m
nginx :
image : nginx:alpine
container_name : qwen-nginx
restart : always
ports :
- "80:80"
- "443:443"
volumes :
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./ssl:/etc/nginx/ssl:ro
depends_on :
- qwen-api
Launch with Docker Compose
# Start services
docker-compose up -d
# View logs
docker-compose logs -f qwen-api
# Stop services
docker-compose down
# Restart a service
docker-compose restart qwen-api
Container Management
Monitoring
# View container logs
docker logs qwen-api
# Follow logs in real-time
docker logs -f qwen-api
# Check container stats
docker stats qwen-api
# Inspect container
docker inspect qwen-api
Interactive Access
# Open shell in running container
docker exec -it qwen-api bash
# Run Python in container
docker exec -it qwen-api python
# Check GPU status
docker exec qwen-api nvidia-smi
Resource Limits
# Limit CPU and memory
docker run --gpus all -d \
--name qwen-api \
--cpus= "4.0" \
--memory= "16g" \
--memory-swap= "16g" \
-p 8000:8000 \
qwenllm/qwen:cu121
Production Best Practices
Run containers as non-root user
Use read-only filesystem where possible
Scan images for vulnerabilities
Keep base images updated
Use secrets management for sensitive data
docker run --gpus all -d \
--user 1000:1000 \
--read-only \
--tmpfs /tmp \
qwenllm/qwen:cu121
Use custom networks for isolation
Implement reverse proxy (Nginx/Traefik)
Enable TLS/HTTPS
Configure proper firewall rules
docker network create qwen-network
docker run --network qwen-network ...
Use volumes for persistent data
Mount model files as read-only
Implement proper backup strategy
Use volume drivers for distributed storage
docker volume create qwen-models
docker run -v qwen-models:/models:ro ...
Use Docker Swarm or Kubernetes for orchestration
Implement health checks
Configure automatic restart policies
Set up load balancing
Monitor container metrics
Troubleshooting
Error : RuntimeError: No CUDA GPUs are availableSolutions :
Install nvidia-docker2:
distribution = $( . /etc/os-release ; echo $ID$VERSION_ID )
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ $distribution /nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
Verify with: docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
Error : CUDA out of memorySolutions :
Use quantized models (Int4/Int8)
Increase Docker memory limit
Use multi-GPU deployment
Reduce max sequence length
Container exits immediately
Error : Permission denied accessing model filesSolution : Fix file permissions:# On host
sudo chown -R 1000:1000 /path/to/models
# Or run container with current user
docker run --user $( id -u ) : $( id -g ) ...
Multi-stage Builds
Reduce image size with multi-stage builds:
# Build stage
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 as builder
RUN apt-get update && apt-get install -y python3 python3-pip
COPY requirements.txt .
RUN pip install --target=/install -r requirements.txt
# Runtime stage
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
COPY --from=builder /install /usr/local/lib/python3.10/dist-packages
# Smaller final image
Layer Caching
Optimize build times:
# Copy requirements first (cached layer)
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy code last (changes frequently)
COPY . .
GPU Memory Management
# Limit GPU memory
docker run --gpus '"device=0"' \
-e CUDA_VISIBLE_DEVICES= 0 \
qwenllm/qwen:cu121
Next Steps
vLLM Deployment Scale up with high-performance vLLM
Kubernetes Deploy on Kubernetes clusters
Production Guide Best practices for production
Monitoring Set up monitoring and alerting