Installation
This guide covers all installation methods for Mini-SGLang, including platform-specific instructions and prerequisites.Prerequisites
Before installing Mini-SGLang, ensure you have the following:System Requirements
- Operating System: Linux (x86_64 or aarch64)
- Python: Version 3.10 or higher (3.12 recommended)
- GPU: NVIDIA GPU with CUDA support
- CUDA Toolkit: Required for JIT-compilation of CUDA kernels
CUDA Toolkit Version: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure your CUDA Toolkit version matches your driver’s CUDA capability. Check your driver version with
nvidia-smi.Verify CUDA Installation
Check your CUDA driver version:Installation Methods
Method 1: Install with uv (Recommended)
We recommend usinguv for fast and reliable installation. Note that uv does not conflict with conda.
Method 2: Install with pip
If you prefer using standard Python tools:Method 3: Docker Installation
Docker provides a consistent environment and is especially useful for cross-platform compatibility.Prerequisites for Docker
- Docker installed
- NVIDIA Container Toolkit installed
Docker Options
Platform-Specific Instructions
Windows (WSL2)
Since Mini-SGLang requires Linux-specific dependencies, Windows users should use WSL2:Install CUDA on WSL2
Follow NVIDIA’s WSL2 CUDA guide to install CUDA support.
Ensure your Windows GPU drivers support WSL2. You can check this by running
nvidia-smi inside WSL2.macOS
Dependencies
Mini-SGLang automatically installs the following core dependencies:- torch (<2.10.0): PyTorch for tensor operations
- transformers (>=4.56.0, <=4.57.3): Hugging Face transformers
- flashinfer-python (>=0.5.3): FlashInfer attention backend
- sgl_kernel (>=0.3.17.post1): Custom CUDA kernels
- apache-tvm-ffi (>=0.1.4): Python binding and JIT interface for kernels
- fastapi: API server framework
- uvicorn: ASGI server
- pyzmq: ZeroMQ for inter-process communication
- accelerate: Hugging Face accelerate library
- modelscope: Alternative model source (useful in China)
- openai: OpenAI client for testing
- prompt_toolkit: Interactive shell interface
Development Dependencies
To install development dependencies for testing and contributing:- pytest and pytest-cov for testing
- black, ruff, and flake8 for code formatting and linting
- mypy for type checking
- matplotlib for benchmarking visualizations
Configuration Options
Mini-SGLang uses command-line arguments for configuration. View all available options:Common Options
--model: HuggingFace model name (e.g., “Qwen/Qwen3-0.6B”)--tp: Tensor parallelism degree (number of GPUs)--port: Server port (default: 1919)--host: Server host (default: 127.0.0.1)--shell: Launch interactive shell mode--model-source: Model source (“huggingface” or “modelscope”)--max-prefill-length: Maximum prefill chunk size--page-size: KV cache page size--attn: Attention backend (e.g., “fa,fi” for FlashAttention prefill and FlashInfer decode)--cache: Cache management strategy (“radix” or “naive”)--cuda-graph-max-bs: Maximum batch size for CUDA graph capture (0 to disable)
Troubleshooting
CUDA Toolkit Issues
If you see errors about missing CUDA toolkit:-
Verify CUDA toolkit is installed:
-
Ensure CUDA toolkit version matches your driver:
-
Add CUDA to your PATH if needed:
Import Errors
If you encounter import errors forsgl_kernel or flashinfer:
- These packages require Linux and CUDA support
- Ensure you’re on a Linux system with NVIDIA GPU
- Verify CUDA toolkit is properly installed
- Try reinstalling:
pip install --force-reinstall sgl_kernel flashinfer-python
Model Download Issues
If you have trouble downloading models from HuggingFace:Out of Memory Errors
If you encounter OOM errors:- Use a smaller model for testing (e.g., Qwen3-0.6B)
- Reduce max prefill length:
--max-prefill-length 2048 - Adjust page size:
--page-size 16 - Reduce CUDA graph batch size:
--cuda-graph-max-bs 32
Next Steps
Quick Start
Get up and running in under 5 minutes
Features
Explore all features and configuration options
System Architecture
Understand the design and data flow
Benchmarks
See performance comparisons