Skip to main content
Mini-SGLang provides official Docker support for easy deployment in containerized environments. This guide covers building images, running containers, and best practices for production deployments.

Prerequisites

1

Install Docker

Follow the official Docker installation guide for your platform.
2

Install NVIDIA Container Toolkit

Required for GPU access in containers:
# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Verify installation:
docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu24.04 nvidia-smi

Building the Docker Image

1

Clone the repository

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
2

Build the image

Build with default settings:
docker build -t minisgl .
Or customize build arguments:
docker build -t minisgl \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg PYTHON_VERSION=3.12 \
  --build-arg UBUNTU_VERSION=24.04 \
  .
3

Verify the build

Check that the image was created:
docker images minisgl
Expected output:
REPOSITORY   TAG       IMAGE ID       CREATED         SIZE
minisgl      latest    abc123def456   2 minutes ago   8.5GB

Running the Container

Basic Server Deployment

Launch an API server with GPU access:
docker run --gpus all -p 1919:1919 \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
The --host 0.0.0.0 flag is required for the server to be accessible outside the container.

Interactive Shell Mode

Run in interactive shell mode:
docker run -it --gpus all \
  minisgl --model Qwen/Qwen3-0.6B --shell

Custom Port Mapping

Map to a different host port:
docker run --gpus all -p 8000:1919 \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
Access at http://localhost:8000

Using Volume Mounts

Persistent Cache Directories

Use Docker volumes to cache downloaded models and compiled kernels:
docker run --gpus all -p 1919:1919 \
  -v huggingface_cache:/app/.cache/huggingface \
  -v tvm_cache:/app/.cache/tvm-ffi \
  -v flashinfer_cache:/app/.cache/flashinfer \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
Using volume mounts significantly speeds up subsequent container starts by avoiding re-downloading models and re-compiling kernels.

Using Host Directories

Alternatively, mount specific host directories:
mkdir -p ~/.cache/minisgl/{huggingface,tvm-ffi,flashinfer}

docker run --gpus all -p 1919:1919 \
  -v ~/.cache/minisgl/huggingface:/app/.cache/huggingface \
  -v ~/.cache/minisgl/tvm-ffi:/app/.cache/tvm-ffi \
  -v ~/.cache/minisgl/flashinfer:/app/.cache/flashinfer \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

Multi-GPU Deployment

All Available GPUs

docker run --gpus all -p 1919:1919 \
  -v huggingface_cache:/app/.cache/huggingface \
  minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --host 0.0.0.0

Specific GPUs

Select specific GPU devices:
docker run --gpus '"device=0,1,2,3"' -p 1919:1919 \
  minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --host 0.0.0.0

Production Deployment

Docker Compose

Create a docker-compose.yml:
version: '3.8'

services:
  minisgl:
    image: minisgl:latest
    command: --model Qwen/Qwen3-0.6B --host 0.0.0.0
    ports:
      - "1919:1919"
    volumes:
      - huggingface_cache:/app/.cache/huggingface
      - tvm_cache:/app/.cache/tvm-ffi
      - flashinfer_cache:/app/.cache/flashinfer
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  huggingface_cache:
  tvm_cache:
  flashinfer_cache:
Launch with:
docker compose up -d

Health Checks

Add a health check to the Dockerfile or docker-compose.yml:
services:
  minisgl:
    # ... other configuration ...
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:1919/v1/models"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s

Resource Limits

Set memory and CPU limits:
docker run --gpus all -p 1919:1919 \
  --memory="32g" \
  --cpus="8" \
  -v huggingface_cache:/app/.cache/huggingface \
  minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

Environment Variables

Pass environment variables to configure the container:
docker run --gpus all -p 1919:1919 \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e HF_TOKEN=your_huggingface_token \
  minisgl --model meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0

Available Environment Variables

CUDA_VISIBLE_DEVICES
string
Comma-separated GPU indices to use
HF_TOKEN
string
HuggingFace authentication token for gated models
MINISGL_DISABLE_OVERLAP_SCHEDULING
boolean
Set to 1 to disable overlap scheduling

Troubleshooting

Verify NVIDIA Container Toolkit is installed:
docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu24.04 nvidia-smi
If this fails, reinstall the NVIDIA Container Toolkit.
The container runs as a non-root user (UID 1001). Ensure mounted volumes have correct permissions:
sudo chown -R 1001:1001 ~/.cache/minisgl
Increase Docker’s memory limit:
# In Docker Desktop: Settings → Resources → Memory
# Or use --memory flag
docker run --gpus all --memory="64g" -p 1919:1919 minisgl --model ...
  • Check network connectivity
  • Try using --model-source modelscope
  • For gated models, provide HF_TOKEN environment variable

Windows (WSL2) Deployment

For Windows users with WSL2:
1

Install WSL2 and Docker Desktop

2

Build and run in WSL2

Open WSL2 terminal and follow the standard Linux instructions:
docker build -t minisgl .
docker run --gpus all -p 1919:1919 minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
3

Access from Windows

The server will be accessible at http://localhost:1919 from Windows browsers and applications.
For production deployments with load balancing and orchestration, consider using Kubernetes with NVIDIA GPU Operator.

Build docs developers (and LLMs) love