Skip to main content
Get Mini-SGLang up and running in less than 5 minutes. This guide shows you the fastest way to install, launch a server, and make your first inference request.
Platform Requirements: Mini-SGLang supports Linux only (x86_64 and aarch64). For Windows users, use WSL2. macOS is not supported due to dependencies on Linux-specific CUDA kernels.

Installation and First Run

1

Install Mini-SGLang

Clone the repository and install with uv (Python 3.10+ required):
git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang && uv venv --python=3.12 && source .venv/bin/activate
uv pip install -e .
Prerequisites: Ensure you have the NVIDIA CUDA Toolkit installed and that its version matches your driver. Check your CUDA version with nvidia-smi.
2

Launch the server

Start an OpenAI-compatible API server with a single command:
python -m minisgl --model "Qwen/Qwen3-0.6B"
The server will start on http://localhost:1919 by default. You’ll see output indicating the server is ready:
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1919 (Press CTRL+C to quit)
3

Make a test request

Send a chat completion request using curl:
curl http://localhost:1919/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 50,
    "stream": false
  }'
4

View the response

You’ll receive a streaming response in OpenAI-compatible format:
{
  "id": "cmpl-0",
  "object": "text_completion.chunk",
  "choices": [
    {
      "delta": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "index": 0,
      "finish_reason": null
    }
  ]
}

Alternative: Interactive Shell

For quick testing and exploration, launch the interactive shell mode:
python -m minisgl --model "Qwen/Qwen3-0.6B" --shell
Type your prompts directly and get real-time responses. Use /reset to clear chat history or /exit to quit.

Quick Examples

curl http://localhost:1919/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 100,
    "stream": true
  }'

Next Steps

Now that you have Mini-SGLang running, explore more capabilities:
If you encounter network issues downloading models from HuggingFace, use --model-source modelscope to download from ModelScope instead.

Build docs developers (and LLMs) love