Quickstart

Get Mini-SGLang up and running in less than 5 minutes. This guide shows you the fastest way to install, launch a server, and make your first inference request.

Platform Requirements: Mini-SGLang supports Linux only (x86_64 and aarch64). For Windows users, use WSL2. macOS is not supported due to dependencies on Linux-specific CUDA kernels.

Installation and First Run

Install Mini-SGLang

Clone the repository and install with uv (Python 3.10+ required):

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang && uv venv --python=3.12 && source .venv/bin/activate
uv pip install -e .

Prerequisites: Ensure you have the NVIDIA CUDA Toolkit installed and that its version matches your driver. Check your CUDA version with nvidia-smi.

Launch the server

Start an OpenAI-compatible API server with a single command:

python -m minisgl --model "Qwen/Qwen3-0.6B"

The server will start on http://localhost:1919 by default. You’ll see output indicating the server is ready:

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1919 (Press CTRL+C to quit)

Make a test request

Send a chat completion request using curl:

curl http://localhost:1919/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 50,
    "stream": false
  }'

View the response

You’ll receive a streaming response in OpenAI-compatible format:

{
  "id": "cmpl-0",
  "object": "text_completion.chunk",
  "choices": [
    {
      "delta": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "index": 0,
      "finish_reason": null
    }
  ]
}

Alternative: Interactive Shell

For quick testing and exploration, launch the interactive shell mode:

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell

Type your prompts directly and get real-time responses. Use /reset to clear chat history or /exit to quit.

Quick Examples

curl http://localhost:1919/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 100,
    "stream": true
  }'

Next Steps

Now that you have Mini-SGLang running, explore more capabilities:

Installation Guide - Detailed installation options including Docker and WSL2
Server Configuration - Configure advanced options like Tensor Parallelism and attention backends
API Reference - Complete OpenAI-compatible API documentation
Core Concepts - Learn about Radix Cache, Chunked Prefill, and other optimizations

If you encounter network issues downloading models from HuggingFace, use --model-source modelscope to download from ModelScope instead.

Getting Started

Core Concepts

Guides

Configuration

Performance

Installation and First Run

Alternative: Interactive Shell

Quick Examples

Next Steps

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Installation and First Run

​Alternative: Interactive Shell

​Quick Examples

​Next Steps

Build docs developers (and LLMs) love

Installation and First Run

Alternative: Interactive Shell

Quick Examples

Next Steps