OpenAI Compatible API

Overview

SGLang provides OpenAI-compatible API endpoints, making it easy to switch from OpenAI to self-hosted models without changing your code.

Base URL

All API endpoints are available at:

http://localhost:30000

Change the host and port using --host and --port flags when launching the server.

Authentication

Optionally enable API key authentication:

sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --api-key your-secret-key

Include the API key in requests:

curl http://localhost:30000/v1/chat/completions \
  -H "Authorization: Bearer your-secret-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

Chat Completions

Endpoint

POST /v1/chat/completions

Basic Example

import openai

client = openai.Client(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # or your API key if authentication is enabled
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.8,
    max_tokens=128
)

print(response.choices[0].message.content)

Streaming Example

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
    temperature=0.8
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Request Parameters

model

string

required

Model identifier. Use the model path or served model name.

messages

array

required

Array of message objects with role and content fields.Roles: system, user, assistant, tool

temperature

float

default:"1.0"

Sampling temperature between 0 and 2. Higher values make output more random.

max_tokens

int

default:"16"

Maximum number of tokens to generate.

top_p

float

default:"1.0"

Nucleus sampling threshold. Only tokens with cumulative probability up to top_p are considered.

top_k

int

default:"-1"

Top-k sampling. Only the top top_k tokens are considered. Set to -1 to disable.

frequency_penalty

float

default:"0.0"

Penalty for token frequency. Range: -2.0 to 2.0.

presence_penalty

float

default:"0.0"

Penalty for token presence. Range: -2.0 to 2.0.

int

default:"1"

Number of completions to generate for each prompt.

stop

string | array

default:"null"

Stop sequences. Generation stops when these strings are encountered.

stream

bool

default:"false"

Enable streaming responses via Server-Sent Events.

logprobs

bool

default:"false"

Return log probabilities of output tokens.

top_logprobs

int

default:"0"

Number of top log probabilities to return for each token.

Response Format

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1699000000,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

SGLang-Specific Extensions

JSON Schema Constraints

Generate structured JSON output:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Generate user info"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "email": {"type": "string"}
                },
                "required": ["name", "age"]
            }
        }
    }
)

Regex Constraints

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Generate a phone number"}],
    extra_body={"regex": r"\d{3}-\d{3}-\d{4}"}
)

Cache Reporting

Enable cache hit reporting (requires --enable-cache-report flag):

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"return_cached_tokens_details": True}
)

# Check cache statistics
if response.usage.prompt_tokens_details:
    print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")

Text Completions

Endpoint

POST /v1/completions

Example

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Once upon a time",
    max_tokens=100,
    temperature=0.8
)

print(response.choices[0].text)

Request Parameters

model

string

required

Model identifier.

prompt

string | array

required

Text prompt(s) or token IDs to generate completions for.

max_tokens

int

default:"16"

Maximum number of tokens to generate.

temperature

float

default:"1.0"

Sampling temperature.

top_p

float

default:"1.0"

Nucleus sampling parameter.

int

default:"1"

Number of completions to generate.

echo

bool

default:"false"

Echo the prompt in addition to the completion.

stream

bool

default:"false"

Enable streaming responses.

Embeddings

Endpoint

POST /v1/embeddings

Example

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=["Hello, world!", "SGLang is fast"]
)

for embedding in response.data:
    print(f"Embedding {embedding.index}: {len(embedding.embedding)} dimensions")

Request Parameters

model

string

required

Embedding model identifier.

input

string | array

required

Text or array of texts to generate embeddings for.

dimensions

int

default:"null"

Output embedding dimensions (if model supports dimension reduction).

Model Information

List Models

GET /v1/models

models = client.models.list()
for model in models.data:
    print(model.id)

Get Model Details

GET /v1/models/{model_id}

model = client.models.retrieve("meta-llama/Llama-3.1-8B-Instruct")
print(f"Max context: {model.max_model_len}")

Health and Status

Health Check

GET /health

curl http://localhost:30000/health

Returns 200 OK if the server is healthy.

Server Information

GET /get_model_info

curl http://localhost:30000/get_model_info

Returns detailed server and model configuration.

Error Handling

API errors return standard HTTP status codes:

400 Bad Request - Invalid request parameters
401 Unauthorized - Missing or invalid API key
404 Not Found - Model or endpoint not found
500 Internal Server Error - Server error
503 Service Unavailable - Server is overloaded

Error response format:

{
  "object": "error",
  "message": "Invalid request: temperature must be non-negative",
  "type": "invalid_request_error",
  "code": 400
}

Rate Limiting

Configure request limits:

max-running-requests

int

default:"null"

Maximum number of requests being processed concurrently.

max-queued-requests

int

default:"null"

Maximum number of requests allowed in the queue.

LoRA Adapters

SGLang supports dynamic LoRA adapter selection per request:

response = client.chat.completions.create(
    model="base-model:adapter-name",  # Specify adapter with colon syntax
    messages=[{"role": "user", "content": "Hello"}]
)

# Or use lora_path parameter
response = client.chat.completions.create(
    model="base-model",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"lora_path": "/path/to/adapter"}
)

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Base URL

​Authentication

​Chat Completions

​Endpoint

​Basic Example

​Streaming Example

​Request Parameters

​Response Format

​SGLang-Specific Extensions

​JSON Schema Constraints

​Regex Constraints

​Cache Reporting

​Text Completions

​Endpoint

​Example

​Request Parameters

​Embeddings

​Endpoint

​Example

​Request Parameters

​Model Information

​List Models

​Get Model Details

​Health and Status

​Health Check

​Server Information

​Error Handling

​Rate Limiting

​LoRA Adapters

​See Also

Overview

Base URL

Authentication

Chat Completions

Endpoint

Basic Example

Streaming Example

Request Parameters

Response Format

SGLang-Specific Extensions

JSON Schema Constraints

Regex Constraints

Cache Reporting

Text Completions

Endpoint

Example

Request Parameters

Embeddings

Endpoint

Example

Request Parameters

Model Information

List Models

Get Model Details

Health and Status

Health Check

Server Information

Error Handling

Rate Limiting

LoRA Adapters

See Also