Docker Deployment

Mini-SGLang provides official Docker support for easy deployment in containerized environments. This guide covers building images, running containers, and best practices for production deployments.

Prerequisites

Install Docker

Follow the official Docker installation guide for your platform.

Install NVIDIA Container Toolkit

Required for GPU access in containers:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify installation:

docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu24.04 nvidia-smi

Building the Docker Image

Clone the repository

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang

Build the image

Build with default settings:

docker build -t minisgl .

Or customize build arguments:

docker build -t minisgl \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg PYTHON_VERSION=3.12 \
  --build-arg UBUNTU_VERSION=24.04 \
  .

Verify the build

Check that the image was created:

docker images minisgl

Expected output:

REPOSITORY   TAG       IMAGE ID       CREATED         SIZE
minisgl      latest    abc123def456   2 minutes ago   8.5GB

Running the Container

Basic Server Deployment

Launch an API server with GPU access:

docker run --gpus all -p 1919:1919 \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

The --host 0.0.0.0 flag is required for the server to be accessible outside the container.

Interactive Shell Mode

Run in interactive shell mode:

docker run -it --gpus all \
  minisgl --model Qwen/Qwen3-0.6B --shell

Custom Port Mapping

Map to a different host port:

docker run --gpus all -p 8000:1919 \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

Access at http://localhost:8000

Using Volume Mounts

Persistent Cache Directories

Use Docker volumes to cache downloaded models and compiled kernels:

docker run --gpus all -p 1919:1919 \
  -v huggingface_cache:/app/.cache/huggingface \
  -v tvm_cache:/app/.cache/tvm-ffi \
  -v flashinfer_cache:/app/.cache/flashinfer \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

Using volume mounts significantly speeds up subsequent container starts by avoiding re-downloading models and re-compiling kernels.

Using Host Directories

Alternatively, mount specific host directories:

mkdir -p ~/.cache/minisgl/{huggingface,tvm-ffi,flashinfer}

docker run --gpus all -p 1919:1919 \
  -v ~/.cache/minisgl/huggingface:/app/.cache/huggingface \
  -v ~/.cache/minisgl/tvm-ffi:/app/.cache/tvm-ffi \
  -v ~/.cache/minisgl/flashinfer:/app/.cache/flashinfer \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

Multi-GPU Deployment

All Available GPUs

docker run --gpus all -p 1919:1919 \
  -v huggingface_cache:/app/.cache/huggingface \
  minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --host 0.0.0.0

Specific GPUs

Select specific GPU devices:

docker run --gpus '"device=0,1,2,3"' -p 1919:1919 \
  minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --host 0.0.0.0

Production Deployment

Docker Compose

Create a docker-compose.yml:

version: '3.8'

services:
  minisgl:
    image: minisgl:latest
    command: --model Qwen/Qwen3-0.6B --host 0.0.0.0
    ports:
      - "1919:1919"
    volumes:
      - huggingface_cache:/app/.cache/huggingface
      - tvm_cache:/app/.cache/tvm-ffi
      - flashinfer_cache:/app/.cache/flashinfer
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  huggingface_cache:
  tvm_cache:
  flashinfer_cache:

Launch with:

docker compose up -d

Health Checks

Add a health check to the Dockerfile or docker-compose.yml:

services:
  minisgl:
    # ... other configuration ...
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:1919/v1/models"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s

Resource Limits

Set memory and CPU limits:

docker run --gpus all -p 1919:1919 \
  --memory="32g" \
  --cpus="8" \
  -v huggingface_cache:/app/.cache/huggingface \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

Environment Variables

Pass environment variables to configure the container:

docker run --gpus all -p 1919:1919 \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e HF_TOKEN=your_huggingface_token \
  minisgl --model meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0

Available Environment Variables

CUDA_VISIBLE_DEVICES

string

Comma-separated GPU indices to use

HF_TOKEN

string

HuggingFace authentication token for gated models

MINISGL_DISABLE_OVERLAP_SCHEDULING

boolean

Set to 1 to disable overlap scheduling

Troubleshooting

GPU Not Detected

Verify NVIDIA Container Toolkit is installed:

docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu24.04 nvidia-smi

If this fails, reinstall the NVIDIA Container Toolkit.

Permission Denied

The container runs as a non-root user (UID 1001). Ensure mounted volumes have correct permissions:

sudo chown -R 1001:1001 ~/.cache/minisgl

Out of Memory

Increase Docker’s memory limit:

# In Docker Desktop: Settings → Resources → Memory
# Or use --memory flag
docker run --gpus all --memory="64g" -p 1919:1919 minisgl --model ...

Model Download Fails

Check network connectivity
Try using --model-source modelscope
For gated models, provide HF_TOKEN environment variable

Windows (WSL2) Deployment

For Windows users with WSL2:

Install WSL2 and Docker Desktop

Install WSL2
Install Docker Desktop with WSL2 backend

Build and run in WSL2

Open WSL2 terminal and follow the standard Linux instructions:

docker build -t minisgl .
docker run --gpus all -p 1919:1919 minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

Access from Windows

The server will be accessible at http://localhost:1919 from Windows browsers and applications.

For production deployments with load balancing and orchestration, consider using Kubernetes with NVIDIA GPU Operator.

Getting Started

Core Concepts

Guides

Configuration

Performance

Prerequisites

Building the Docker Image

Running the Container

Basic Server Deployment

Interactive Shell Mode

Custom Port Mapping

Using Volume Mounts

Persistent Cache Directories

Using Host Directories

Multi-GPU Deployment

All Available GPUs

Specific GPUs

Production Deployment

Docker Compose

Health Checks

Resource Limits

Environment Variables

Available Environment Variables

Troubleshooting

Windows (WSL2) Deployment

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Prerequisites

​Building the Docker Image

​Running the Container

​Basic Server Deployment

​Interactive Shell Mode

​Custom Port Mapping

​Using Volume Mounts

​Persistent Cache Directories

​Using Host Directories

​Multi-GPU Deployment

​All Available GPUs

​Specific GPUs

​Production Deployment

​Docker Compose

​Health Checks

​Resource Limits

​Environment Variables

​Available Environment Variables

​Troubleshooting

​Windows (WSL2) Deployment

Build docs developers (and LLMs) love

Prerequisites

Building the Docker Image

Running the Container

Basic Server Deployment

Interactive Shell Mode

Custom Port Mapping

Using Volume Mounts

Persistent Cache Directories

Using Host Directories

Multi-GPU Deployment

All Available GPUs

Specific GPUs

Production Deployment

Docker Compose

Health Checks

Resource Limits

Environment Variables

Available Environment Variables

Troubleshooting

Windows (WSL2) Deployment