Installation

This guide covers all installation methods for Mini-SGLang, including platform-specific instructions and prerequisites.

Platform Support: Mini-SGLang currently supports Linux only (x86_64 and aarch64). Windows and macOS are not supported due to dependencies on Linux-specific CUDA kernels (sgl-kernel, flashinfer).

Prerequisites

Before installing Mini-SGLang, ensure you have the following:

System Requirements

Operating System: Linux (x86_64 or aarch64)
Python: Version 3.10 or higher (3.12 recommended)
GPU: NVIDIA GPU with CUDA support
CUDA Toolkit: Required for JIT-compilation of CUDA kernels

CUDA Toolkit Version: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure your CUDA Toolkit version matches your driver’s CUDA capability. Check your driver version with nvidia-smi.

Verify CUDA Installation

Check your CUDA driver version:

nvidia-smi

Ensure the NVIDIA CUDA Toolkit is installed and accessible. You can download it from NVIDIA’s website.

Installation Methods

Method 1: Install with uv (Recommended)

We recommend using uv for fast and reliable installation. Note that uv does not conflict with conda.

Install uv (if not already installed)

curl -LsSf https://astral.sh/uv/install.sh | sh

Create a virtual environment

uv venv --python=3.12
source .venv/bin/activate

Clone and install Mini-SGLang

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
uv pip install -e .

Verify installation

python -m minisgl --help

Method 2: Install with pip

If you prefer using standard Python tools:

Create a virtual environment

python3.12 -m venv .venv
source .venv/bin/activate

Clone and install Mini-SGLang

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
pip install -e .

Verify installation

python -m minisgl --help

Method 3: Docker Installation

Docker provides a consistent environment and is especially useful for cross-platform compatibility.

Prerequisites for Docker

Docker installed
NVIDIA Container Toolkit installed

Clone the repository

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang

Build the Docker image

docker build -t minisgl .

Run the server

docker run --gpus all -p 1919:1919 \
    minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

The server will be accessible at http://localhost:1919.

Docker Options

# Run the server on default port 1919
docker run --gpus all -p 1919:1919 \
    minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0

Using Docker volumes for caches (as shown in the third example) significantly speeds up subsequent container starts by persisting downloaded models and compiled kernels.

Platform-Specific Instructions

Windows (WSL2)

Since Mini-SGLang requires Linux-specific dependencies, Windows users should use WSL2:

Install WSL2

Open PowerShell as Administrator and run:

wsl --install

Restart your computer when prompted.

Install CUDA on WSL2

Follow NVIDIA’s WSL2 CUDA guide to install CUDA support.

Ensure your Windows GPU drivers support WSL2. You can check this by running nvidia-smi inside WSL2.

Install Mini-SGLang in WSL2

Inside the WSL2 terminal:

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang && uv venv --python=3.12 && source .venv/bin/activate
uv pip install -e .

Access from Windows

The server will be accessible at http://localhost:1919 from Windows browsers and applications.

macOS

macOS is not supported due to dependencies on Linux-specific CUDA kernels. macOS users should use Docker or a Linux VM to run Mini-SGLang.

Dependencies

Mini-SGLang automatically installs the following core dependencies:

torch (<2.10.0): PyTorch for tensor operations
transformers (>=4.56.0, <=4.57.3): Hugging Face transformers
flashinfer-python (>=0.5.3): FlashInfer attention backend
sgl_kernel (>=0.3.17.post1): Custom CUDA kernels
apache-tvm-ffi (>=0.1.4): Python binding and JIT interface for kernels
fastapi: API server framework
uvicorn: ASGI server
pyzmq: ZeroMQ for inter-process communication
accelerate: Hugging Face accelerate library
modelscope: Alternative model source (useful in China)
openai: OpenAI client for testing
prompt_toolkit: Interactive shell interface

Development Dependencies

To install development dependencies for testing and contributing:

pip install -e ".[dev]"

This includes:

pytest and pytest-cov for testing
black, ruff, and flake8 for code formatting and linting
mypy for type checking
matplotlib for benchmarking visualizations

Configuration Options

Mini-SGLang uses command-line arguments for configuration. View all available options:

python -m minisgl --help

Common Options

--model: HuggingFace model name (e.g., “Qwen/Qwen3-0.6B”)
--tp: Tensor parallelism degree (number of GPUs)
--port: Server port (default: 1919)
--host: Server host (default: 127.0.0.1)
--shell: Launch interactive shell mode
--model-source: Model source (“huggingface” or “modelscope”)
--max-prefill-length: Maximum prefill chunk size
--page-size: KV cache page size
--attn: Attention backend (e.g., “fa,fi” for FlashAttention prefill and FlashInfer decode)
--cache: Cache management strategy (“radix” or “naive”)
--cuda-graph-max-bs: Maximum batch size for CUDA graph capture (0 to disable)

Troubleshooting

CUDA Toolkit Issues

If you see errors about missing CUDA toolkit:

Verify CUDA toolkit is installed:
```
nvcc --version
```
Ensure CUDA toolkit version matches your driver:
```
nvidia-smi
```

Add CUDA to your PATH if needed:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Import Errors

If you encounter import errors for sgl_kernel or flashinfer:

These packages require Linux and CUDA support
Ensure you’re on a Linux system with NVIDIA GPU
Verify CUDA toolkit is properly installed
Try reinstalling: pip install --force-reinstall sgl_kernel flashinfer-python

Model Download Issues

If you have trouble downloading models from HuggingFace:

# Use ModelScope instead (especially useful in China)
python -m minisgl --model "Qwen/Qwen3-0.6B" --model-source modelscope

Out of Memory Errors

If you encounter OOM errors:

Use a smaller model for testing (e.g., Qwen3-0.6B)
Reduce max prefill length: --max-prefill-length 2048
Adjust page size: --page-size 16
Reduce CUDA graph batch size: --cuda-graph-max-bs 32

Next Steps

Quick Start

Get up and running in under 5 minutes

Features

Explore all features and configuration options

System Architecture

Understand the design and data flow

Benchmarks

See performance comparisons

Getting Started

Core Concepts

Guides

Configuration

Performance

Installation

Installation

Prerequisites

System Requirements

Verify CUDA Installation

Installation Methods

Method 1: Install with uv (Recommended)

Method 2: Install with pip

Method 3: Docker Installation

Prerequisites for Docker

Docker Options

Platform-Specific Instructions

Windows (WSL2)

macOS

Dependencies

Development Dependencies

Configuration Options

Common Options

Troubleshooting

CUDA Toolkit Issues

Import Errors

Model Download Issues

Out of Memory Errors

Next Steps

Quick Start

Features

System Architecture

Benchmarks

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Installation

​Prerequisites

​System Requirements

​Verify CUDA Installation

​Installation Methods

​Method 1: Install with uv (Recommended)

​Method 2: Install with pip

​Method 3: Docker Installation

​Prerequisites for Docker

​Docker Options

​Platform-Specific Instructions

​Windows (WSL2)

​macOS

​Dependencies

​Development Dependencies

​Configuration Options

​Common Options

​Troubleshooting

​CUDA Toolkit Issues

​Import Errors

​Model Download Issues

​Out of Memory Errors

​Next Steps

Quick Start

Features

System Architecture

Benchmarks

Build docs developers (and LLMs) love

Installation

Prerequisites

System Requirements

Verify CUDA Installation

Installation Methods

Method 1: Install with uv (Recommended)

Method 2: Install with pip

Method 3: Docker Installation

Prerequisites for Docker

Docker Options

Platform-Specific Instructions

Windows (WSL2)

macOS

Dependencies

Development Dependencies

Configuration Options

Common Options

Troubleshooting

CUDA Toolkit Issues

Import Errors

Model Download Issues

Out of Memory Errors

Next Steps