Structured Outputs

SGLang enables you to constrain model outputs to follow specific formats using JSON schemas, regular expressions, or EBNF grammars. The model output is guaranteed to follow the specified constraints.

Grammar Backends

SGLang supports three grammar backends for constrained generation:

XGrammar

Default backend - Best performance and utility. Supports JSON schema, regex, and EBNF.

Outlines

Supports JSON schema and regex constraints.

Llguidance

Supports JSON schema, regex, and EBNF constraints.

To select a backend, use --grammar-backend when launching the server:

# Use default XGrammar backend
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct

# Use Outlines backend
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --grammar-backend outlines

# Use Llguidance backend
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --grammar-backend llguidance

For better output quality, explicitly include instructions in your prompt to guide the model. For example: “Please generate the output in the following JSON format: …”

JSON Schema Constraints

Constrain outputs to valid JSON following a specific schema. This is useful for extracting structured data from model responses.

Using Pydantic Models

from pydantic import BaseModel, Field
import openai

class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
    population: int = Field(..., description="Population of the capital city")

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Please generate the information of the capital of France in JSON format.",
        },
    ],
    temperature=0,
    max_tokens=128,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "capital_info",
            "schema": CapitalInfo.model_json_schema(),
        },
    },
)

# Validate the response
capital_info = CapitalInfo.model_validate_json(response.choices[0].message.content)
print(capital_info.model_dump_json())

Using Direct JSON Schema

import json

json_schema = json.dumps({
    "type": "object",
    "properties": {
        "name": {"type": "string", "pattern": "^[\\w]+$"},
        "population": {"type": "integer"},
    },
    "required": ["name", "population"],
})

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Give me the information of the capital of France in JSON format.",
        },
    ],
    temperature=0,
    max_tokens=128,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "capital_info", "schema": json.loads(json_schema)},
    },
)

print(response.choices[0].message.content)

EBNF Grammars

Define custom grammars using Extended Backus-Naur Form (EBNF) notation. XGrammar uses the GGML BNF format.

ebnf_grammar = """
root ::= city | description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful geography bot."},
        {
            "role": "user",
            "content": "Give me the information of the capital of France.",
        },
    ],
    temperature=0,
    max_tokens=32,
    extra_body={"ebnf": ebnf_grammar},
)

print(response.choices[0].message.content)
# Output: "Paris is the capital of France"

Regular Expression Constraints

Constrain outputs to match a specific regex pattern.

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=128,
    extra_body={"regex": "(Paris|London)"},
)

print(response.choices[0].message.content)
# Output: "Paris"

Structural Tags

Combine multiple schemas with trigger patterns for complex structured outputs, such as function calling.

tool_weather = {
    "type": "object",
    "properties": {
        "city": {"type": "string"},
        "state": {"type": "string"},
        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
    },
    "required": ["city", "state", "unit"],
}

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=messages,
    response_format={
        "type": "structural_tag",
        "format": {
            "type": "triggered_tags",
            "triggers": ["<function="],
            "tags": [
                {
                    "begin": "<function=get_current_weather>",
                    "content": {
                        "type": "json_schema",
                        "json_schema": tool_weather,
                    },
                    "end": "</function>",
                },
            ],
        },
    },
)

Native API Usage

You can also use structured outputs with the native SGLang API:

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "What is the capital of France?",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 128,
            "regex": "(Paris|London)",
        },
    },
)

print(response.json()["text"])

Implementation Details

SGLang’s constrained generation is implemented through the GrammarManager which:

Compiles grammars - Converts JSON schemas, regex, or EBNF into efficient grammar objects
Caches compiled grammars - Reuses compiled grammars across requests for better performance
Applies constraints during generation - Modifies logits to ensure only valid tokens are sampled
Supports jump-forward optimization - Skips ahead when only one valid continuation exists

The grammar compilation happens asynchronously to avoid blocking request processing. Requests wait in a grammar queue until their grammar objects are ready. Source: python/sglang/srt/constrained/grammar_manager.py:24

Performance Considerations

Grammar Compilation Overhead

The first request with a new schema incurs compilation overhead. Subsequent requests with the same schema benefit from caching.

Logit Processing Cost

Applying grammar constraints adds per-token overhead. The impact varies by grammar complexity.

Jump-Forward Optimization

When the grammar has only one valid continuation, SGLang can skip token-by-token generation and jump forward, significantly improving throughput.

Configuration Options

Parameter	Description	Default
`--grammar-backend`	Backend to use: `xgrammar`, `outlines`, or `llguidance`	`xgrammar`
`--skip-tokenizer-init`	Skip tokenizer initialization (disables grammar support)	`False`

For XGrammar technical details and performance characteristics, see the XGrammar technical overview.

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

Structured Outputs

Grammar Backends

XGrammar

Outlines

Llguidance

JSON Schema Constraints

Using Pydantic Models

Using Direct JSON Schema

EBNF Grammars

Regular Expression Constraints

Structural Tags

Native API Usage

Implementation Details

Performance Considerations

Configuration Options

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Grammar Backends

XGrammar

Outlines

Llguidance

​JSON Schema Constraints

​Using Pydantic Models

​Using Direct JSON Schema

​EBNF Grammars

​Regular Expression Constraints

​Structural Tags

​Native API Usage

​Implementation Details

​Performance Considerations

​Configuration Options

Grammar Backends

JSON Schema Constraints

Using Pydantic Models

Using Direct JSON Schema

EBNF Grammars

Regular Expression Constraints

Structural Tags

Native API Usage

Implementation Details

Performance Considerations

Configuration Options