Skip to main content

Overview

SGLang provides OpenAI-compatible API endpoints, making it easy to switch from OpenAI to self-hosted models without changing your code.

Base URL

All API endpoints are available at:
http://localhost:30000
Change the host and port using --host and --port flags when launching the server.

Authentication

Optionally enable API key authentication:
sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --api-key your-secret-key
Include the API key in requests:
curl http://localhost:30000/v1/chat/completions \
  -H "Authorization: Bearer your-secret-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

Chat Completions

Endpoint

POST /v1/chat/completions

Basic Example

import openai

client = openai.Client(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # or your API key if authentication is enabled
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.8,
    max_tokens=128
)

print(response.choices[0].message.content)

Streaming Example

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
    temperature=0.8
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Request Parameters

model
string
required
Model identifier. Use the model path or served model name.
messages
array
required
Array of message objects with role and content fields.Roles: system, user, assistant, tool
temperature
float
default:"1.0"
Sampling temperature between 0 and 2. Higher values make output more random.
max_tokens
int
default:"16"
Maximum number of tokens to generate.
top_p
float
default:"1.0"
Nucleus sampling threshold. Only tokens with cumulative probability up to top_p are considered.
top_k
int
default:"-1"
Top-k sampling. Only the top top_k tokens are considered. Set to -1 to disable.
frequency_penalty
float
default:"0.0"
Penalty for token frequency. Range: -2.0 to 2.0.
presence_penalty
float
default:"0.0"
Penalty for token presence. Range: -2.0 to 2.0.
n
int
default:"1"
Number of completions to generate for each prompt.
stop
string | array
default:"null"
Stop sequences. Generation stops when these strings are encountered.
stream
bool
default:"false"
Enable streaming responses via Server-Sent Events.
logprobs
bool
default:"false"
Return log probabilities of output tokens.
top_logprobs
int
default:"0"
Number of top log probabilities to return for each token.

Response Format

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1699000000,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

SGLang-Specific Extensions

JSON Schema Constraints

Generate structured JSON output:
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Generate user info"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "email": {"type": "string"}
                },
                "required": ["name", "age"]
            }
        }
    }
)

Regex Constraints

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Generate a phone number"}],
    extra_body={"regex": r"\d{3}-\d{3}-\d{4}"}
)

Cache Reporting

Enable cache hit reporting (requires --enable-cache-report flag):
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"return_cached_tokens_details": True}
)

# Check cache statistics
if response.usage.prompt_tokens_details:
    print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")

Text Completions

Endpoint

POST /v1/completions

Example

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Once upon a time",
    max_tokens=100,
    temperature=0.8
)

print(response.choices[0].text)

Request Parameters

model
string
required
Model identifier.
prompt
string | array
required
Text prompt(s) or token IDs to generate completions for.
max_tokens
int
default:"16"
Maximum number of tokens to generate.
temperature
float
default:"1.0"
Sampling temperature.
top_p
float
default:"1.0"
Nucleus sampling parameter.
n
int
default:"1"
Number of completions to generate.
echo
bool
default:"false"
Echo the prompt in addition to the completion.
stream
bool
default:"false"
Enable streaming responses.

Embeddings

Endpoint

POST /v1/embeddings

Example

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=["Hello, world!", "SGLang is fast"]
)

for embedding in response.data:
    print(f"Embedding {embedding.index}: {len(embedding.embedding)} dimensions")

Request Parameters

model
string
required
Embedding model identifier.
input
string | array
required
Text or array of texts to generate embeddings for.
dimensions
int
default:"null"
Output embedding dimensions (if model supports dimension reduction).

Model Information

List Models

GET /v1/models
models = client.models.list()
for model in models.data:
    print(model.id)

Get Model Details

GET /v1/models/{model_id}
model = client.models.retrieve("meta-llama/Llama-3.1-8B-Instruct")
print(f"Max context: {model.max_model_len}")

Health and Status

Health Check

GET /health
curl http://localhost:30000/health
Returns 200 OK if the server is healthy.

Server Information

GET /get_model_info
curl http://localhost:30000/get_model_info
Returns detailed server and model configuration.

Error Handling

API errors return standard HTTP status codes:
  • 400 Bad Request - Invalid request parameters
  • 401 Unauthorized - Missing or invalid API key
  • 404 Not Found - Model or endpoint not found
  • 500 Internal Server Error - Server error
  • 503 Service Unavailable - Server is overloaded
Error response format:
{
  "object": "error",
  "message": "Invalid request: temperature must be non-negative",
  "type": "invalid_request_error",
  "code": 400
}

Rate Limiting

Configure request limits:
max-running-requests
int
default:"null"
Maximum number of requests being processed concurrently.
max-queued-requests
int
default:"null"
Maximum number of requests allowed in the queue.

LoRA Adapters

SGLang supports dynamic LoRA adapter selection per request:
response = client.chat.completions.create(
    model="base-model:adapter-name",  # Specify adapter with colon syntax
    messages=[{"role": "user", "content": "Hello"}]
)

# Or use lora_path parameter
response = client.chat.completions.create(
    model="base-model",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"lora_path": "/path/to/adapter"}
)

See Also