Skip to main content
Mini-SGLang provides an OpenAI-compatible API server that allows you to deploy large language models and integrate them with existing tools and clients.

Launching the API Server

1

Start the server

Launch the API server with a single command. The server will be available on port 1919 by default.
python -m minisgl --model "Qwen/Qwen3-0.6B"
To use a custom port:
python -m minisgl --model "Qwen/Qwen3-0.6B" --port 30000
2

Wait for server initialization

The server will download the model (if not already cached) and compile necessary kernels. You should see output indicating the server is ready:
API server is ready to serve on 0.0.0.0:1919
3

Verify the server is running

Check available models:
curl http://localhost:1919/v1/models
Expected output:
{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen3-0.6B",
      "object": "model",
      "created": 1709510400,
      "owned_by": "mini-sglang",
      "root": "Qwen/Qwen3-0.6B"
    }
  ]
}

Sending Requests

Chat Completions

The /v1/chat/completions endpoint accepts OpenAI-compatible requests:
curl http://localhost:1919/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": true
  }'

Supported Parameters

model
string
required
Model identifier (must match the deployed model)
messages
array
required
Array of message objects with role and content fields
max_tokens
integer
default:"16"
Maximum number of tokens to generate
temperature
float
default:"1.0"
Sampling temperature (higher values = more random)
top_k
integer
default:"-1"
Top-k sampling parameter
top_p
float
default:"1.0"
Nucleus sampling parameter
stream
boolean
default:"false"
Enable streaming responses

Response Format

Streaming responses return Server-Sent Events (SSE) with the following format:
data: {"id":"cmpl-0","object":"text_completion.chunk","choices":[{"delta":{"role":"assistant"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-0","object":"text_completion.chunk","choices":[{"delta":{"content":"The"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-0","object":"text_completion.chunk","choices":[{"delta":{"content":" capital"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-0","object":"text_completion.chunk","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: [DONE]

OpenAI Client Compatibility

You can use the official OpenAI Python client to interact with Mini-SGLang:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1919/v1",
    api_key="EMPTY"  # API key not required
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    max_tokens=200,
    temperature=0.7,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Server Configuration

Custom Host and Port

python -m minisgl --model "Qwen/Qwen3-0.6B" --host 0.0.0.0 --port 8000

Using ModelScope

If you have network issues with HuggingFace, use ModelScope as the model source:
python -m minisgl --model "Qwen/Qwen3-32B" --model-source modelscope
For production deployments with multiple GPUs, see the Distributed Serving guide.

Build docs developers (and LLMs) love