Skip to main content

Installation

This guide covers all installation methods for Mini-SGLang, including platform-specific instructions and prerequisites.
Platform Support: Mini-SGLang currently supports Linux only (x86_64 and aarch64). Windows and macOS are not supported due to dependencies on Linux-specific CUDA kernels (sgl-kernel, flashinfer).

Prerequisites

Before installing Mini-SGLang, ensure you have the following:

System Requirements

  • Operating System: Linux (x86_64 or aarch64)
  • Python: Version 3.10 or higher (3.12 recommended)
  • GPU: NVIDIA GPU with CUDA support
  • CUDA Toolkit: Required for JIT-compilation of CUDA kernels
CUDA Toolkit Version: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure your CUDA Toolkit version matches your driver’s CUDA capability. Check your driver version with nvidia-smi.

Verify CUDA Installation

Check your CUDA driver version:
nvidia-smi
Ensure the NVIDIA CUDA Toolkit is installed and accessible. You can download it from NVIDIA’s website.

Installation Methods

We recommend using uv for fast and reliable installation. Note that uv does not conflict with conda.
1

Install uv (if not already installed)

curl -LsSf https://astral.sh/uv/install.sh | sh
2

Create a virtual environment

uv venv --python=3.12
source .venv/bin/activate
3

Clone and install Mini-SGLang

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
uv pip install -e .
4

Verify installation

python -m minisgl --help

Method 2: Install with pip

If you prefer using standard Python tools:
1

Create a virtual environment

python3.12 -m venv .venv
source .venv/bin/activate
2

Clone and install Mini-SGLang

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
pip install -e .
3

Verify installation

python -m minisgl --help

Method 3: Docker Installation

Docker provides a consistent environment and is especially useful for cross-platform compatibility.

Prerequisites for Docker

1

Clone the repository

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
2

Build the Docker image

docker build -t minisgl .
3

Run the server

docker run --gpus all -p 1919:1919 \
    minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
The server will be accessible at http://localhost:1919.

Docker Options

# Run the server on default port 1919
docker run --gpus all -p 1919:1919 \
    minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
Using Docker volumes for caches (as shown in the third example) significantly speeds up subsequent container starts by persisting downloaded models and compiled kernels.

Platform-Specific Instructions

Windows (WSL2)

Since Mini-SGLang requires Linux-specific dependencies, Windows users should use WSL2:
1

Install WSL2

Open PowerShell as Administrator and run:
wsl --install
Restart your computer when prompted.
2

Install CUDA on WSL2

Follow NVIDIA’s WSL2 CUDA guide to install CUDA support.
Ensure your Windows GPU drivers support WSL2. You can check this by running nvidia-smi inside WSL2.
3

Install Mini-SGLang in WSL2

Inside the WSL2 terminal:
git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang && uv venv --python=3.12 && source .venv/bin/activate
uv pip install -e .
4

Access from Windows

The server will be accessible at http://localhost:1919 from Windows browsers and applications.

macOS

macOS is not supported due to dependencies on Linux-specific CUDA kernels. macOS users should use Docker or a Linux VM to run Mini-SGLang.

Dependencies

Mini-SGLang automatically installs the following core dependencies:
  • torch (<2.10.0): PyTorch for tensor operations
  • transformers (>=4.56.0, <=4.57.3): Hugging Face transformers
  • flashinfer-python (>=0.5.3): FlashInfer attention backend
  • sgl_kernel (>=0.3.17.post1): Custom CUDA kernels
  • apache-tvm-ffi (>=0.1.4): Python binding and JIT interface for kernels
  • fastapi: API server framework
  • uvicorn: ASGI server
  • pyzmq: ZeroMQ for inter-process communication
  • accelerate: Hugging Face accelerate library
  • modelscope: Alternative model source (useful in China)
  • openai: OpenAI client for testing
  • prompt_toolkit: Interactive shell interface

Development Dependencies

To install development dependencies for testing and contributing:
pip install -e ".[dev]"
This includes:
  • pytest and pytest-cov for testing
  • black, ruff, and flake8 for code formatting and linting
  • mypy for type checking
  • matplotlib for benchmarking visualizations

Configuration Options

Mini-SGLang uses command-line arguments for configuration. View all available options:
python -m minisgl --help

Common Options

  • --model: HuggingFace model name (e.g., “Qwen/Qwen3-0.6B”)
  • --tp: Tensor parallelism degree (number of GPUs)
  • --port: Server port (default: 1919)
  • --host: Server host (default: 127.0.0.1)
  • --shell: Launch interactive shell mode
  • --model-source: Model source (“huggingface” or “modelscope”)
  • --max-prefill-length: Maximum prefill chunk size
  • --page-size: KV cache page size
  • --attn: Attention backend (e.g., “fa,fi” for FlashAttention prefill and FlashInfer decode)
  • --cache: Cache management strategy (“radix” or “naive”)
  • --cuda-graph-max-bs: Maximum batch size for CUDA graph capture (0 to disable)

Troubleshooting

CUDA Toolkit Issues

If you see errors about missing CUDA toolkit:
  1. Verify CUDA toolkit is installed:
    nvcc --version
    
  2. Ensure CUDA toolkit version matches your driver:
    nvidia-smi
    
  3. Add CUDA to your PATH if needed:
    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
    

Import Errors

If you encounter import errors for sgl_kernel or flashinfer:
  • These packages require Linux and CUDA support
  • Ensure you’re on a Linux system with NVIDIA GPU
  • Verify CUDA toolkit is properly installed
  • Try reinstalling: pip install --force-reinstall sgl_kernel flashinfer-python

Model Download Issues

If you have trouble downloading models from HuggingFace:
# Use ModelScope instead (especially useful in China)
python -m minisgl --model "Qwen/Qwen3-0.6B" --model-source modelscope

Out of Memory Errors

If you encounter OOM errors:
  1. Use a smaller model for testing (e.g., Qwen3-0.6B)
  2. Reduce max prefill length: --max-prefill-length 2048
  3. Adjust page size: --page-size 16
  4. Reduce CUDA graph batch size: --cuda-graph-max-bs 32

Next Steps

Quick Start

Get up and running in under 5 minutes

Features

Explore all features and configuration options

System Architecture

Understand the design and data flow

Benchmarks

See performance comparisons

Build docs developers (and LLMs) love