Skip to main content
Docker provides a containerized environment for running Qwen models with all dependencies pre-configured. This is the easiest way to get started with production deployments.

Pre-built Docker Images

Qwen provides official Docker images on Docker Hub:
# CUDA 11.7 (default)
docker pull qwenllm/qwen:cu117

# CUDA 11.4
docker pull qwenllm/qwen:cu114

# CUDA 12.1 (latest)
docker pull qwenllm/qwen:cu121
Choose the image that matches your NVIDIA driver version. Check compatibility at NVIDIA CUDA Compatibility.

Quick Start

Web Demo Deployment

1

Download the Deployment Script

git clone https://github.com/QwenLM/Qwen.git
cd Qwen/docker
2

Run the Web Demo

bash docker_web_demo.sh \
  -c /path/to/Qwen-7B-Chat \
  -n qwen-web \
  --port 8901
3

Access the Interface

Open your browser and navigate to http://localhost:8901

OpenAI API Server Deployment

1

Run the API Server

bash docker_openai_api.sh \
  -c /path/to/Qwen-7B-Chat \
  -n qwen-api \
  --port 8000
2

Test the API

curl http://localhost:8000/v1/models

CLI Demo Deployment

bash docker_cli_demo.sh \
  -c /path/to/Qwen-7B-Chat \
  -n qwen-cli

Manual Docker Commands

Basic Container Launch

docker run --gpus all -it --rm \
  --name qwen-chat \
  -v /path/to/models:/data/shared/Qwen/models \
  -p 8000:8000 \
  qwenllm/qwen:cu117 \
  python openai_api.py -c /data/shared/Qwen/models/Qwen-7B-Chat --server-port 8000 --server-name 0.0.0.0

Persistent Container

For long-running deployments:
docker run --gpus all -d \
  --name qwen-api \
  --restart always \
  -v /path/to/models:/models:ro \
  -p 8000:80 \
  qwenllm/qwen:cu117 \
  python openai_api.py -c /models/Qwen-7B-Chat --server-port 80 --server-name 0.0.0.0
--gpus all
flag
required
Enable GPU access for the container
-d
flag
Run container in detached mode (background)
--restart always
flag
Automatically restart container on failure or system reboot
-v
mount
Mount host directory to container. Use :ro for read-only access
-p
port mapping
Map container port to host port (host:container)

Custom Dockerfile

Build your own Docker image with specific requirements:
ARG CUDA_VERSION=11.7.1
FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04

# Install system dependencies
RUN apt update -y && apt upgrade -y && apt install -y \
    git git-lfs python3 python3-pip python3-dev wget vim \
    && rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/bin/python3 /usr/bin/python
RUN git lfs install

# Create working directory
WORKDIR /workspace

# Install Python dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Flash Attention (optional but recommended)
RUN pip3 install flash-attn --no-build-isolation

# Copy application code
COPY openai_api.py .
COPY cli_demo.py .
COPY web_demo.py .

EXPOSE 8000

CMD ["python", "openai_api.py", "-c", "/models/Qwen-Chat", "--server-port", "8000", "--server-name", "0.0.0.0"]

Build and Run Custom Image

# Build the image
docker build -t qwen-custom:latest -f Dockerfile .

# Run the container
docker run --gpus all -d \
  --name qwen-api \
  -v /path/to/models:/models:ro \
  -p 8000:8000 \
  qwen-custom:latest \
  -c /models/Qwen-7B-Chat

Docker Compose

Manage multi-container deployments with Docker Compose:
version: '3.8'

services:
  qwen-api:
    image: qwenllm/qwen:cu121
    container_name: qwen-api
    restart: always
    ports:
      - "8000:8000"
    volumes:
      - /path/to/models:/models:ro
      - ./logs:/workspace/logs
    environment:
      - CUDA_VISIBLE_DEVICES=0
    command: >
      python openai_api.py
      -c /models/Qwen-7B-Chat
      --server-port 8000
      --server-name 0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 5m

  nginx:
    image: nginx:alpine
    container_name: qwen-nginx
    restart: always
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/nginx/ssl:ro
    depends_on:
      - qwen-api

Launch with Docker Compose

# Start services
docker-compose up -d

# View logs
docker-compose logs -f qwen-api

# Stop services
docker-compose down

# Restart a service
docker-compose restart qwen-api

Container Management

Monitoring

# View container logs
docker logs qwen-api

# Follow logs in real-time
docker logs -f qwen-api

# Check container stats
docker stats qwen-api

# Inspect container
docker inspect qwen-api

Interactive Access

# Open shell in running container
docker exec -it qwen-api bash

# Run Python in container
docker exec -it qwen-api python

# Check GPU status
docker exec qwen-api nvidia-smi

Resource Limits

# Limit CPU and memory
docker run --gpus all -d \
  --name qwen-api \
  --cpus="4.0" \
  --memory="16g" \
  --memory-swap="16g" \
  -p 8000:8000 \
  qwenllm/qwen:cu121

Production Best Practices

  • Run containers as non-root user
  • Use read-only filesystem where possible
  • Scan images for vulnerabilities
  • Keep base images updated
  • Use secrets management for sensitive data
docker run --gpus all -d \
  --user 1000:1000 \
  --read-only \
  --tmpfs /tmp \
  qwenllm/qwen:cu121
  • Use custom networks for isolation
  • Implement reverse proxy (Nginx/Traefik)
  • Enable TLS/HTTPS
  • Configure proper firewall rules
docker network create qwen-network
docker run --network qwen-network ...
  • Use volumes for persistent data
  • Mount model files as read-only
  • Implement proper backup strategy
  • Use volume drivers for distributed storage
docker volume create qwen-models
docker run -v qwen-models:/models:ro ...
  • Use Docker Swarm or Kubernetes for orchestration
  • Implement health checks
  • Configure automatic restart policies
  • Set up load balancing
  • Monitor container metrics

Troubleshooting

Error: RuntimeError: No CUDA GPUs are availableSolutions:
  • Install nvidia-docker2:
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update && sudo apt-get install -y nvidia-docker2
    sudo systemctl restart docker
    
  • Verify with: docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
Error: CUDA out of memorySolutions:
  • Use quantized models (Int4/Int8)
  • Increase Docker memory limit
  • Use multi-GPU deployment
  • Reduce max sequence length
Issue: Container stops right after startingDebug steps:
# Check logs
docker logs qwen-api

# Run in interactive mode
docker run --gpus all -it --rm qwenllm/qwen:cu121 bash

# Test model loading
docker run --gpus all -it --rm \
  -v /path/to/models:/models \
  qwenllm/qwen:cu121 \
  python -c "from transformers import AutoModel; AutoModel.from_pretrained('/models/Qwen-7B-Chat', trust_remote_code=True)"
Error: Permission denied accessing model filesSolution: Fix file permissions:
# On host
sudo chown -R 1000:1000 /path/to/models

# Or run container with current user
docker run --user $(id -u):$(id -g) ...

Performance Optimization

Multi-stage Builds

Reduce image size with multi-stage builds:
# Build stage
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 as builder
RUN apt-get update && apt-get install -y python3 python3-pip
COPY requirements.txt .
RUN pip install --target=/install -r requirements.txt

# Runtime stage
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
COPY --from=builder /install /usr/local/lib/python3.10/dist-packages
# Smaller final image

Layer Caching

Optimize build times:
# Copy requirements first (cached layer)
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy code last (changes frequently)
COPY . .

GPU Memory Management

# Limit GPU memory
docker run --gpus '"device=0"' \
  -e CUDA_VISIBLE_DEVICES=0 \
  qwenllm/qwen:cu121

Next Steps

vLLM Deployment

Scale up with high-performance vLLM

Kubernetes

Deploy on Kubernetes clusters

Production Guide

Best practices for production

Monitoring

Set up monitoring and alerting

Build docs developers (and LLMs) love