Mini-SGLang provides an OpenAI-compatible API server that allows you to deploy large language models and integrate them with existing tools and clients.
Launching the API Server
Start the server
Launch the API server with a single command. The server will be available on port 1919 by default. python -m minisgl --model "Qwen/Qwen3-0.6B"
To use a custom port: python -m minisgl --model "Qwen/Qwen3-0.6B" --port 30000
Wait for server initialization
The server will download the model (if not already cached) and compile necessary kernels. You should see output indicating the server is ready: API server is ready to serve on 0.0.0.0:1919
Verify the server is running
Check available models: curl http://localhost:1919/v1/models
Expected output: {
"object" : "list" ,
"data" : [
{
"id" : "Qwen/Qwen3-0.6B" ,
"object" : "model" ,
"created" : 1709510400 ,
"owned_by" : "mini-sglang" ,
"root" : "Qwen/Qwen3-0.6B"
}
]
}
Sending Requests
Chat Completions
The /v1/chat/completions endpoint accepts OpenAI-compatible requests:
curl http://localhost:1919/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100,
"temperature": 0.7,
"stream": true
}'
Supported Parameters
Model identifier (must match the deployed model)
Array of message objects with role and content fields
Maximum number of tokens to generate
Sampling temperature (higher values = more random)
Nucleus sampling parameter
Enable streaming responses
Streaming responses return Server-Sent Events (SSE) with the following format:
data: { "id" : "cmpl-0" , "object" : "text_completion.chunk" , "choices" :[{ "delta" :{ "role" : "assistant" }, "index" : 0 , "finish_reason" : null }]}
data: { "id" : "cmpl-0" , "object" : "text_completion.chunk" , "choices" :[{ "delta" :{ "content" : "The" }, "index" : 0 , "finish_reason" : null }]}
data: { "id" : "cmpl-0" , "object" : "text_completion.chunk" , "choices" :[{ "delta" :{ "content" : " capital" }, "index" : 0 , "finish_reason" : null }]}
data: { "id" : "cmpl-0" , "object" : "text_completion.chunk" , "choices" :[{ "delta" :{}, "index" : 0 , "finish_reason" : "stop" }]}
data: [ DONE ]
OpenAI Client Compatibility
You can use the official OpenAI Python client to interact with Mini-SGLang:
Python (OpenAI SDK)
Python (Requests)
JavaScript (Node.js)
from openai import OpenAI
client = OpenAI(
base_url = "http://localhost:1919/v1" ,
api_key = "EMPTY" # API key not required
)
response = client.chat.completions.create(
model = "Qwen/Qwen3-0.6B" ,
messages = [
{ "role" : "user" , "content" : "Explain quantum computing in simple terms" }
],
max_tokens = 200 ,
temperature = 0.7 ,
stream = True
)
for chunk in response:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
Server Configuration
Custom Host and Port
python -m minisgl --model "Qwen/Qwen3-0.6B" --host 0.0.0.0 --port 8000
Using ModelScope
If you have network issues with HuggingFace, use ModelScope as the model source:
python -m minisgl --model "Qwen/Qwen3-32B" --model-source modelscope