Constrained Decoding

Constrained decoding is useful when using function/tool calling as it helps ensure the output is in the correct format and guarantees structured outputs.

Overview

ONNX Runtime GenAI integrates LLGuidance for constrained decoding, enabling you to control the format and structure of model outputs.

Constraint Types

There are three types of constrained decoding available:

Lark Grammar

Recommended - Allows both regular output and function/tool output in JSON format

JSON Schema

Output will match JSON schema and be one of the provided functions/tools

Regex

Match a specific regular expression pattern

Configuration

Tokenizer Modification

To ensure function/tool calling works correctly with constrained decoding, you need to modify your tokenizer.json file. For each model that has its own tool calling token, set the token’s special attribute to true.

Example: Phi-4 mini uses <|tool_call|> and <|/tool_call|> tokens, so you should set the special attribute for them as true inside tokenizer.json.

Using Lark Grammar (Recommended)

Lark grammar provides the most flexibility, allowing both regular and structured outputs.

Example: JSON Schema with Lark Grammar

import argparse
import json
import onnxruntime_genai as og
from datasets import load_dataset

# Load a JSON schema from a dataset
dataset = load_dataset(path="epfl-dlab/JSONSchemaBench", name="Github_hard", split="test")
schema = json.loads(dataset[0]["json_schema"])

# Set up model and tokenizer
config = og.Config(model_path)
model = og.Model(config)
tokenizer = og.Tokenizer(model)

# Configure search options
search_options = {
    "batch_size": 1,
    "temperature": 0.0,
    "max_length": 2048
}

# Create generator params
params = og.GeneratorParams(model)
params.set_search_options(**search_options)

# Configure guidance with Lark grammar
schema["x-guidance"] = {
    "whitespace_flexible": False,
    "key_separator": ": ",
    "item_separator": ", "
}

guidance_type = "lark_grammar"
guidance_input = f"""start: %json {json.dumps(schema)}\n"""

# Set guidance on the params
params.set_guidance(guidance_type, guidance_input, enable_ff_tokens=False)

# Create generator and run inference
generator = og.Generator(model, params)

# Apply chat template and encode
messages = [
    {"role": "system", "content": "You need to generate a JSON object that matches the schema below."},
    {"role": "user", "content": json.dumps(schema, indent=2)}
]

final_prompt = tokenizer.apply_chat_template(messages=json.dumps(messages), add_generation_prompt=True)
final_input = tokenizer.encode(final_prompt)
generator.append_tokens(final_input)

# Generate tokens
while not generator.is_done():
    generator.generate_next_token()

# Get the result
output_tokens = generator.get_sequence(0)
output_text = tokenizer.decode(output_tokens)

# Verify valid JSON
result = json.loads(output_text)
print(json.dumps(result, indent=2))

Using JSON Schema

JSON schema constraints ensure outputs conform to a specific structure.

import onnxruntime_genai as og
import json

# Define your JSON schema
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "email": {"type": "string", "format": "email"}
    },
    "required": ["name", "age"]
}

# Set up model
model = og.Model(model_path)
params = og.GeneratorParams(model)

# Set JSON schema guidance
guidance_type = "json_schema"
guidance_input = json.dumps(schema)

params.set_guidance(guidance_type, guidance_input)

# Generate with constraints
generator = og.Generator(model, params)
# ... run generation

Using Regex Constraints

Regex constraints allow you to match specific patterns.

import onnxruntime_genai as og

# Set up model
model = og.Model(model_path)
params = og.GeneratorParams(model)

# Set regex guidance (e.g., email pattern)
guidance_type = "regex"
guidance_input = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

params.set_guidance(guidance_type, guidance_input)

# Generate with constraints
generator = og.Generator(model, params)
# ... run generation

Function/Tool Calling

Constrained decoding is particularly useful for function and tool calling scenarios.

import onnxruntime_genai as og
import json

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

# Create schema that allows either text or tool calls
schema = {
    "oneOf": [
        {"type": "string"},  # Regular text response
        {"$ref": "#/definitions/tool_calls"}  # Tool call
    ],
    "definitions": {
        "tool_calls": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "arguments": {"type": "object"}
                }
            }
        }
    }
}

# Set up guidance
model = og.Model(model_path)
params = og.GeneratorParams(model)

guidance_type = "lark_grammar"
guidance_input = f"""start: %json {json.dumps(schema)}\n"""

params.set_guidance(guidance_type, guidance_input)

Best Practices

Use Lark Grammar for Flexibility

Lark grammar is recommended because it supports both regular text output and structured function/tool calls. This gives you maximum flexibility in your applications.

Validate Tokenizer Configuration

Always verify that tool calling tokens are marked as special in your tokenizer.json. This is critical for proper parsing of structured outputs.

Test Schema Validity

Before deploying, test your JSON schemas thoroughly to ensure they capture all valid outputs and properly constrain invalid ones.

Handle Generation Errors

Implement error handling for cases where the model cannot satisfy the constraints. Consider fallback strategies.

Performance Considerations

Constrained decoding adds computational overhead to token generation. The complexity increases with:

More complex grammars
Larger JSON schemas
More intricate regex patterns

For optimal performance:

Keep schemas as simple as possible
Use specific constraints rather than overly broad ones
Test with your expected load to measure impact

Next Steps

Runtime Options

Configure additional runtime settings

Model Builder

Prepare models for constrained decoding

API Reference

Explore the Generator API

Examples

View code examples on GitHub

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Overview

Constraint Types

Lark Grammar

JSON Schema

Regex

Configuration

Tokenizer Modification

Using Lark Grammar (Recommended)

Example: JSON Schema with Lark Grammar

Using JSON Schema

Using Regex Constraints

Function/Tool Calling

Best Practices

Performance Considerations

Next Steps

Runtime Options

Model Builder

API Reference

Examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Overview

​Constraint Types

Lark Grammar

JSON Schema

Regex

​Configuration

​Tokenizer Modification

​Using Lark Grammar (Recommended)

​Example: JSON Schema with Lark Grammar

​Using JSON Schema

​Using Regex Constraints

​Function/Tool Calling

​Best Practices

​Performance Considerations

​Next Steps

Runtime Options

Model Builder

API Reference

Examples

Build docs developers (and LLMs) love

Overview

Constraint Types

Configuration

Tokenizer Modification

Using Lark Grammar (Recommended)

Example: JSON Schema with Lark Grammar

Using JSON Schema

Using Regex Constraints

Function/Tool Calling

Best Practices

Performance Considerations

Next Steps