Skip to main content

Models

The models endpoint provides information about available models. This endpoint is compatible with OpenAI’s /v1/models API.

List Models

Retrieve a list of all available models.

Request

curl http://localhost:30000/v1/models
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

models = client.models.list()
for model in models.data:
    print(f"Model ID: {model.id}")
    print(f"Created: {model.created}")
    print(f"Owned by: {model.owned_by}")
    print()

Response

object
string
Always "list".
data
array
Array of model objects.
id
string
Model identifier (e.g., "meta-llama/Llama-3.1-8B-Instruct").
object
string
Always "model".
created
integer
Unix timestamp when the model was added.
owned_by
string
Organization that owns the model (always "sglang").
root
string | null
Root model identifier.
parent
string | null
Parent model identifier.
max_model_len
integer | null
Maximum context length supported by the model.

Example Response

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.1-8B-Instruct",
      "object": "model",
      "created": 1234567890,
      "owned_by": "sglang",
      "root": null,
      "parent": null,
      "max_model_len": 131072
    }
  ]
}

Retrieve Model

Get information about a specific model.

Request

curl http://localhost:30000/v1/models/meta-llama/Llama-3.1-8B-Instruct
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

model = client.models.retrieve("meta-llama/Llama-3.1-8B-Instruct")
print(f"Model: {model.id}")
print(f"Max length: {model.max_model_len}")

Response

id
string
Model identifier.
object
string
Always "model".
created
integer
Unix timestamp when the model was added.
owned_by
string
Organization that owns the model.
root
string | null
Root model identifier.
parent
string | null
Parent model identifier.
max_model_len
integer | null
Maximum context length.

Example Response

{
  "id": "meta-llama/Llama-3.1-8B-Instruct",
  "object": "model",
  "created": 1234567890,
  "owned_by": "sglang",
  "root": null,
  "parent": null,
  "max_model_len": 131072
}

LoRA Adapters

When using LoRA adapters, you can reference them using the syntax base-model:adapter-name:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

# Using a LoRA adapter
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct:my-lora-adapter",
    messages=[{"role": "user", "content": "Hello!"}]
)

Multi-Model Serving

SGLang supports serving multiple models simultaneously using different methods:

Data Parallelism (DP)

Multiple replicas of the same model for higher throughput:
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --dp-size 4

Multiple LoRA Adapters

Serve a base model with multiple LoRA adapters:
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --lora-paths adapter1,adapter2,adapter3

Examples

List All Models

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

models = client.models.list()
print(f"Available models: {len(models.data)}")

for model in models.data:
    max_len = model.max_model_len or "Unknown"
    print(f"- {model.id} (max context: {max_len})")

Check Model Capabilities

model = client.models.retrieve("meta-llama/Llama-3.1-8B-Instruct")

# Check if model supports long context
if model.max_model_len and model.max_model_len >= 100000:
    print(f"{model.id} supports long context ({model.max_model_len} tokens)")
else:
    print(f"{model.id} has limited context ({model.max_model_len} tokens)")

Verify Model Before Request

try:
    model = client.models.retrieve("meta-llama/Llama-3.1-8B-Instruct")
    print(f"Model {model.id} is available")
    
    # Now make a request
    response = client.chat.completions.create(
        model=model.id,
        messages=[{"role": "user", "content": "Hello!"}]
    )
except Exception as e:
    print(f"Model not available: {e}")

Error Handling

Model Not Found

If you request a model that doesn’t exist:
try:
    model = client.models.retrieve("nonexistent-model")
except Exception as e:
    print(f"Error: {e}")
    # Error: Model 'nonexistent-model' not found

Supported Models

SGLang supports a wide range of models including:

Language Models

  • Llama: Llama 2, Llama 3, Llama 3.1, Llama 3.2
  • Qwen: Qwen, Qwen2, Qwen2.5
  • Mistral: Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
  • DeepSeek: DeepSeek V2, DeepSeek V3
  • Gemma: Gemma 2B, Gemma 7B, Gemma 2

Vision-Language Models

  • Llama 3.2 Vision: 11B, 90B
  • Qwen2-VL: 2B, 7B, 72B
  • InternVL: 2, 2.5
  • LLaVA: 1.5, 1.6, OneVision

Other Models

  • Embedding Models: BGE, E5, etc.
  • Reasoning Models: GPT-OSS models with reasoning support
For a complete list of supported models, see the supported models documentation.

See Also