Docker Deployment

Docker provides a containerized environment for running Qwen models with all dependencies pre-configured. This is the easiest way to get started with production deployments.

Pre-built Docker Images

Qwen provides official Docker images on Docker Hub:

# CUDA 11.7 (default)
docker pull qwenllm/qwen:cu117

# CUDA 11.4
docker pull qwenllm/qwen:cu114

# CUDA 12.1 (latest)
docker pull qwenllm/qwen:cu121

Choose the image that matches your NVIDIA driver version. Check compatibility at NVIDIA CUDA Compatibility.

Quick Start

Web Demo Deployment

Download the Deployment Script

git clone https://github.com/QwenLM/Qwen.git
cd Qwen/docker

Run the Web Demo

bash docker_web_demo.sh \
  -c /path/to/Qwen-7B-Chat \
  -n qwen-web \
  --port 8901

Access the Interface

Open your browser and navigate to http://localhost:8901

OpenAI API Server Deployment

Run the API Server

bash docker_openai_api.sh \
  -c /path/to/Qwen-7B-Chat \
  -n qwen-api \
  --port 8000

Test the API

curl http://localhost:8000/v1/models

CLI Demo Deployment

bash docker_cli_demo.sh \
  -c /path/to/Qwen-7B-Chat \
  -n qwen-cli

Manual Docker Commands

Basic Container Launch

docker run --gpus all -it --rm \
  --name qwen-chat \
  -v /path/to/models:/data/shared/Qwen/models \
  -p 8000:8000 \
  qwenllm/qwen:cu117 \
  python openai_api.py -c /data/shared/Qwen/models/Qwen-7B-Chat --server-port 8000 --server-name 0.0.0.0

Persistent Container

For long-running deployments:

docker run --gpus all -d \
  --name qwen-api \
  --restart always \
  -v /path/to/models:/models:ro \
  -p 8000:80 \
  qwenllm/qwen:cu117 \
  python openai_api.py -c /models/Qwen-7B-Chat --server-port 80 --server-name 0.0.0.0

--gpus all

flag

required

Enable GPU access for the container

-d

flag

Run container in detached mode (background)

--restart always

flag

Automatically restart container on failure or system reboot

-v

mount

Mount host directory to container. Use :ro for read-only access

-p

port mapping

Map container port to host port (host:container)

Custom Dockerfile

Build your own Docker image with specific requirements:

ARG CUDA_VERSION=11.7.1
FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04

# Install system dependencies
RUN apt update -y && apt upgrade -y && apt install -y \
    git git-lfs python3 python3-pip python3-dev wget vim \
    && rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/bin/python3 /usr/bin/python
RUN git lfs install

# Create working directory
WORKDIR /workspace

# Install Python dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Flash Attention (optional but recommended)
RUN pip3 install flash-attn --no-build-isolation

# Copy application code
COPY openai_api.py .
COPY cli_demo.py .
COPY web_demo.py .

EXPOSE 8000

CMD ["python", "openai_api.py", "-c", "/models/Qwen-Chat", "--server-port", "8000", "--server-name", "0.0.0.0"]

Build and Run Custom Image

# Build the image
docker build -t qwen-custom:latest -f Dockerfile .

# Run the container
docker run --gpus all -d \
  --name qwen-api \
  -v /path/to/models:/models:ro \
  -p 8000:8000 \
  qwen-custom:latest \
  -c /models/Qwen-7B-Chat

Docker Compose

Manage multi-container deployments with Docker Compose:

version: '3.8'

services:
  qwen-api:
    image: qwenllm/qwen:cu121
    container_name: qwen-api
    restart: always
    ports:
      - "8000:8000"
    volumes:
      - /path/to/models:/models:ro
      - ./logs:/workspace/logs
    environment:
      - CUDA_VISIBLE_DEVICES=0
    command: >
      python openai_api.py
      -c /models/Qwen-7B-Chat
      --server-port 8000
      --server-name 0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 5m

  nginx:
    image: nginx:alpine
    container_name: qwen-nginx
    restart: always
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/nginx/ssl:ro
    depends_on:
      - qwen-api

Launch with Docker Compose

# Start services
docker-compose up -d

# View logs
docker-compose logs -f qwen-api

# Stop services
docker-compose down

# Restart a service
docker-compose restart qwen-api

Container Management

Monitoring

# View container logs
docker logs qwen-api

# Follow logs in real-time
docker logs -f qwen-api

# Check container stats
docker stats qwen-api

# Inspect container
docker inspect qwen-api

Interactive Access

# Open shell in running container
docker exec -it qwen-api bash

# Run Python in container
docker exec -it qwen-api python

# Check GPU status
docker exec qwen-api nvidia-smi

Resource Limits

# Limit CPU and memory
docker run --gpus all -d \
  --name qwen-api \
  --cpus="4.0" \
  --memory="16g" \
  --memory-swap="16g" \
  -p 8000:8000 \
  qwenllm/qwen:cu121

Production Best Practices

Security

Run containers as non-root user
Use read-only filesystem where possible
Scan images for vulnerabilities
Keep base images updated
Use secrets management for sensitive data

docker run --gpus all -d \
  --user 1000:1000 \
  --read-only \
  --tmpfs /tmp \
  qwenllm/qwen:cu121

Networking

Use custom networks for isolation
Implement reverse proxy (Nginx/Traefik)
Enable TLS/HTTPS
Configure proper firewall rules

docker network create qwen-network
docker run --network qwen-network ...

Storage

Use volumes for persistent data
Mount model files as read-only
Implement proper backup strategy
Use volume drivers for distributed storage

docker volume create qwen-models
docker run -v qwen-models:/models:ro ...

High Availability

Use Docker Swarm or Kubernetes for orchestration
Implement health checks
Configure automatic restart policies
Set up load balancing
Monitor container metrics

Troubleshooting

GPU not detected

Error: RuntimeError: No CUDA GPUs are availableSolutions:

Install nvidia-docker2:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Verify with: docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi

Out of memory

Error: CUDA out of memorySolutions:

Use quantized models (Int4/Int8)
Increase Docker memory limit
Use multi-GPU deployment
Reduce max sequence length

Container exits immediately

Issue: Container stops right after startingDebug steps:

# Check logs
docker logs qwen-api

# Run in interactive mode
docker run --gpus all -it --rm qwenllm/qwen:cu121 bash

# Test model loading
docker run --gpus all -it --rm \
  -v /path/to/models:/models \
  qwenllm/qwen:cu121 \
  python -c "from transformers import AutoModel; AutoModel.from_pretrained('/models/Qwen-7B-Chat', trust_remote_code=True)"

Permission denied

Error: Permission denied accessing model filesSolution: Fix file permissions:

# On host
sudo chown -R 1000:1000 /path/to/models

# Or run container with current user
docker run --user $(id -u):$(id -g) ...

Performance Optimization

Multi-stage Builds

Reduce image size with multi-stage builds:

# Build stage
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 as builder
RUN apt-get update && apt-get install -y python3 python3-pip
COPY requirements.txt .
RUN pip install --target=/install -r requirements.txt

# Runtime stage
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
COPY --from=builder /install /usr/local/lib/python3.10/dist-packages
# Smaller final image

Layer Caching

Optimize build times:

# Copy requirements first (cached layer)
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy code last (changes frequently)
COPY . .

GPU Memory Management

# Limit GPU memory
docker run --gpus '"device=0"' \
  -e CUDA_VISIBLE_DEVICES=0 \
  qwenllm/qwen:cu121

Next Steps

vLLM Deployment

Scale up with high-performance vLLM

Kubernetes

Deploy on Kubernetes clusters

Production Guide

Best practices for production

Monitoring

Set up monitoring and alerting

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

Docker Deployment

Pre-built Docker Images

Quick Start

Web Demo Deployment

OpenAI API Server Deployment

CLI Demo Deployment

Manual Docker Commands

Basic Container Launch

Persistent Container

Custom Dockerfile

Build and Run Custom Image

Docker Compose

Launch with Docker Compose

Container Management

Monitoring

Interactive Access

Resource Limits

Production Best Practices

Troubleshooting

Performance Optimization

Multi-stage Builds

Layer Caching

GPU Memory Management

Next Steps

vLLM Deployment

Kubernetes

Production Guide

Monitoring

Build docs developers (and LLMs) love

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Pre-built Docker Images

​Quick Start

​Web Demo Deployment

​OpenAI API Server Deployment

​CLI Demo Deployment

​Manual Docker Commands

​Basic Container Launch

​Persistent Container

​Custom Dockerfile

​Build and Run Custom Image

​Docker Compose

​Launch with Docker Compose

​Container Management

​Monitoring

​Interactive Access

​Resource Limits

​Production Best Practices

​Troubleshooting

​Performance Optimization

​Multi-stage Builds

​Layer Caching

​GPU Memory Management

​Next Steps

vLLM Deployment

Kubernetes

Production Guide

Monitoring

Build docs developers (and LLMs) love

Pre-built Docker Images

Quick Start

Web Demo Deployment

OpenAI API Server Deployment

CLI Demo Deployment

Manual Docker Commands

Basic Container Launch

Persistent Container

Custom Dockerfile

Build and Run Custom Image

Docker Compose

Launch with Docker Compose

Container Management

Monitoring

Interactive Access

Resource Limits

Production Best Practices

Troubleshooting

Performance Optimization

Multi-stage Builds

Layer Caching

GPU Memory Management

Next Steps