Online Serving

Mini-SGLang provides an OpenAI-compatible API server that allows you to deploy large language models and integrate them with existing tools and clients.

Launching the API Server

Start the server

Launch the API server with a single command. The server will be available on port 1919 by default.

python -m minisgl --model "Qwen/Qwen3-0.6B"

To use a custom port:

python -m minisgl --model "Qwen/Qwen3-0.6B" --port 30000

Wait for server initialization

The server will download the model (if not already cached) and compile necessary kernels. You should see output indicating the server is ready:

API server is ready to serve on 0.0.0.0:1919

Verify the server is running

Check available models:

curl http://localhost:1919/v1/models

Expected output:

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen3-0.6B",
      "object": "model",
      "created": 1709510400,
      "owned_by": "mini-sglang",
      "root": "Qwen/Qwen3-0.6B"
    }
  ]
}

Sending Requests

Chat Completions

The /v1/chat/completions endpoint accepts OpenAI-compatible requests:

curl http://localhost:1919/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": true
  }'

Supported Parameters

model

string

required

Model identifier (must match the deployed model)

messages

array

required

Array of message objects with role and content fields

max_tokens

integer

default:"16"

Maximum number of tokens to generate

temperature

float

default:"1.0"

Sampling temperature (higher values = more random)

top_k

integer

default:"-1"

Top-k sampling parameter

top_p

float

default:"1.0"

Nucleus sampling parameter

stream

boolean

default:"false"

Enable streaming responses

Response Format

Streaming responses return Server-Sent Events (SSE) with the following format:

data: {"id":"cmpl-0","object":"text_completion.chunk","choices":[{"delta":{"role":"assistant"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-0","object":"text_completion.chunk","choices":[{"delta":{"content":"The"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-0","object":"text_completion.chunk","choices":[{"delta":{"content":" capital"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-0","object":"text_completion.chunk","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: [DONE]

OpenAI Client Compatibility

You can use the official OpenAI Python client to interact with Mini-SGLang:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1919/v1",
    api_key="EMPTY"  # API key not required
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    max_tokens=200,
    temperature=0.7,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Server Configuration

Custom Host and Port

python -m minisgl --model "Qwen/Qwen3-0.6B" --host 0.0.0.0 --port 8000

Using ModelScope

If you have network issues with HuggingFace, use ModelScope as the model source:

python -m minisgl --model "Qwen/Qwen3-32B" --model-source modelscope

For production deployments with multiple GPUs, see the Distributed Serving guide.

Getting Started

Core Concepts

Guides

Configuration

Performance

Launching the API Server

Sending Requests

Chat Completions

Supported Parameters

Response Format

OpenAI Client Compatibility

Server Configuration

Custom Host and Port

Using ModelScope

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Launching the API Server

​Sending Requests

​Chat Completions

​Supported Parameters

​Response Format

​OpenAI Client Compatibility

​Server Configuration

​Custom Host and Port

​Using ModelScope

Build docs developers (and LLMs) love

Launching the API Server

Sending Requests

Chat Completions

Supported Parameters

Response Format

OpenAI Client Compatibility

Server Configuration

Custom Host and Port

Using ModelScope