Mini-SGLang provides official Docker support for easy deployment in containerized environments. This guide covers building images, running containers, and best practices for production deployments.
Prerequisites
Install NVIDIA Container Toolkit
Required for GPU access in containers: # Ubuntu/Debian
distribution = $( . /etc/os-release ; echo $ID$VERSION_ID )
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ $distribution /nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Verify installation: docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu24.04 nvidia-smi
Building the Docker Image
Clone the repository
git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
Build the image
Build with default settings: docker build -t minisgl .
Or customize build arguments: docker build -t minisgl \
--build-arg CUDA_VERSION= 12.8.1 \
--build-arg PYTHON_VERSION= 3.12 \
--build-arg UBUNTU_VERSION= 24.04 \
.
Verify the build
Check that the image was created: Expected output: REPOSITORY TAG IMAGE ID CREATED SIZE
minisgl latest abc123def456 2 minutes ago 8.5GB
Running the Container
Basic Server Deployment
Launch an API server with GPU access:
docker run --gpus all -p 1919:1919 \
minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
The --host 0.0.0.0 flag is required for the server to be accessible outside the container.
Interactive Shell Mode
Run in interactive shell mode:
docker run -it --gpus all \
minisgl --model Qwen/Qwen3-0.6B --shell
Custom Port Mapping
Map to a different host port:
docker run --gpus all -p 8000:1919 \
minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
Access at http://localhost:8000
Using Volume Mounts
Persistent Cache Directories
Use Docker volumes to cache downloaded models and compiled kernels:
docker run --gpus all -p 1919:1919 \
-v huggingface_cache:/app/.cache/huggingface \
-v tvm_cache:/app/.cache/tvm-ffi \
-v flashinfer_cache:/app/.cache/flashinfer \
minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
Using volume mounts significantly speeds up subsequent container starts by avoiding re-downloading models and re-compiling kernels.
Using Host Directories
Alternatively, mount specific host directories:
mkdir -p ~/.cache/minisgl/{huggingface,tvm-ffi,flashinfer}
docker run --gpus all -p 1919:1919 \
-v ~/.cache/minisgl/huggingface:/app/.cache/huggingface \
-v ~/.cache/minisgl/tvm-ffi:/app/.cache/tvm-ffi \
-v ~/.cache/minisgl/flashinfer:/app/.cache/flashinfer \
minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
Multi-GPU Deployment
All Available GPUs
docker run --gpus all -p 1919:1919 \
-v huggingface_cache:/app/.cache/huggingface \
minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --host 0.0.0.0
Specific GPUs
Select specific GPU devices:
docker run --gpus '"device=0,1,2,3"' -p 1919:1919 \
minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --host 0.0.0.0
Production Deployment
Docker Compose
Create a docker-compose.yml:
version : '3.8'
services :
minisgl :
image : minisgl:latest
command : --model Qwen/Qwen3-0.6B --host 0.0.0.0
ports :
- "1919:1919"
volumes :
- huggingface_cache:/app/.cache/huggingface
- tvm_cache:/app/.cache/tvm-ffi
- flashinfer_cache:/app/.cache/flashinfer
deploy :
resources :
reservations :
devices :
- driver : nvidia
count : all
capabilities : [ gpu ]
restart : unless-stopped
volumes :
huggingface_cache :
tvm_cache :
flashinfer_cache :
Launch with:
Health Checks
Add a health check to the Dockerfile or docker-compose.yml:
services :
minisgl :
# ... other configuration ...
healthcheck :
test : [ "CMD" , "curl" , "-f" , "http://localhost:1919/v1/models" ]
interval : 30s
timeout : 10s
retries : 3
start_period : 120s
Resource Limits
Set memory and CPU limits:
docker run --gpus all -p 1919:1919 \
--memory= "32g" \
--cpus= "8" \
-v huggingface_cache:/app/.cache/huggingface \
minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
Environment Variables
Pass environment variables to configure the container:
docker run --gpus all -p 1919:1919 \
-e CUDA_VISIBLE_DEVICES=0,1 \
-e HF_TOKEN=your_huggingface_token \
minisgl --model meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0
Available Environment Variables
Comma-separated GPU indices to use
HuggingFace authentication token for gated models
MINISGL_DISABLE_OVERLAP_SCHEDULING
Set to 1 to disable overlap scheduling
Troubleshooting
Verify NVIDIA Container Toolkit is installed: docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu24.04 nvidia-smi
If this fails, reinstall the NVIDIA Container Toolkit.
The container runs as a non-root user (UID 1001). Ensure mounted volumes have correct permissions: sudo chown -R 1001:1001 ~/.cache/minisgl
Increase Docker’s memory limit: # In Docker Desktop: Settings → Resources → Memory
# Or use --memory flag
docker run --gpus all --memory= "64g" -p 1919:1919 minisgl --model ...
Check network connectivity
Try using --model-source modelscope
For gated models, provide HF_TOKEN environment variable
Windows (WSL2) Deployment
For Windows users with WSL2:
Install WSL2 and Docker Desktop
Build and run in WSL2
Open WSL2 terminal and follow the standard Linux instructions: docker build -t minisgl .
docker run --gpus all -p 1919:1919 minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
Access from Windows
The server will be accessible at http://localhost:1919 from Windows browsers and applications.
For production deployments with load balancing and orchestration, consider using Kubernetes with NVIDIA GPU Operator.