Skip to main content
Constrained decoding is useful when using function/tool calling as it helps ensure the output is in the correct format and guarantees structured outputs.

Overview

ONNX Runtime GenAI integrates LLGuidance for constrained decoding, enabling you to control the format and structure of model outputs.

Constraint Types

There are three types of constrained decoding available:

Lark Grammar

Recommended - Allows both regular output and function/tool output in JSON format

JSON Schema

Output will match JSON schema and be one of the provided functions/tools

Regex

Match a specific regular expression pattern

Configuration

Tokenizer Modification

To ensure function/tool calling works correctly with constrained decoding, you need to modify your tokenizer.json file. For each model that has its own tool calling token, set the token’s special attribute to true.
Example: Phi-4 mini uses <|tool_call|> and <|/tool_call|> tokens, so you should set the special attribute for them as true inside tokenizer.json.
Lark grammar provides the most flexibility, allowing both regular and structured outputs.

Example: JSON Schema with Lark Grammar

import argparse
import json
import onnxruntime_genai as og
from datasets import load_dataset

# Load a JSON schema from a dataset
dataset = load_dataset(path="epfl-dlab/JSONSchemaBench", name="Github_hard", split="test")
schema = json.loads(dataset[0]["json_schema"])

# Set up model and tokenizer
config = og.Config(model_path)
model = og.Model(config)
tokenizer = og.Tokenizer(model)

# Configure search options
search_options = {
    "batch_size": 1,
    "temperature": 0.0,
    "max_length": 2048
}

# Create generator params
params = og.GeneratorParams(model)
params.set_search_options(**search_options)

# Configure guidance with Lark grammar
schema["x-guidance"] = {
    "whitespace_flexible": False,
    "key_separator": ": ",
    "item_separator": ", "
}

guidance_type = "lark_grammar"
guidance_input = f"""start: %json {json.dumps(schema)}\n"""

# Set guidance on the params
params.set_guidance(guidance_type, guidance_input, enable_ff_tokens=False)

# Create generator and run inference
generator = og.Generator(model, params)

# Apply chat template and encode
messages = [
    {"role": "system", "content": "You need to generate a JSON object that matches the schema below."},
    {"role": "user", "content": json.dumps(schema, indent=2)}
]

final_prompt = tokenizer.apply_chat_template(messages=json.dumps(messages), add_generation_prompt=True)
final_input = tokenizer.encode(final_prompt)
generator.append_tokens(final_input)

# Generate tokens
while not generator.is_done():
    generator.generate_next_token()

# Get the result
output_tokens = generator.get_sequence(0)
output_text = tokenizer.decode(output_tokens)

# Verify valid JSON
result = json.loads(output_text)
print(json.dumps(result, indent=2))

Using JSON Schema

JSON schema constraints ensure outputs conform to a specific structure.
import onnxruntime_genai as og
import json

# Define your JSON schema
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "email": {"type": "string", "format": "email"}
    },
    "required": ["name", "age"]
}

# Set up model
model = og.Model(model_path)
params = og.GeneratorParams(model)

# Set JSON schema guidance
guidance_type = "json_schema"
guidance_input = json.dumps(schema)

params.set_guidance(guidance_type, guidance_input)

# Generate with constraints
generator = og.Generator(model, params)
# ... run generation

Using Regex Constraints

Regex constraints allow you to match specific patterns.
import onnxruntime_genai as og

# Set up model
model = og.Model(model_path)
params = og.GeneratorParams(model)

# Set regex guidance (e.g., email pattern)
guidance_type = "regex"
guidance_input = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

params.set_guidance(guidance_type, guidance_input)

# Generate with constraints
generator = og.Generator(model, params)
# ... run generation

Function/Tool Calling

Constrained decoding is particularly useful for function and tool calling scenarios.
import onnxruntime_genai as og
import json

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

# Create schema that allows either text or tool calls
schema = {
    "oneOf": [
        {"type": "string"},  # Regular text response
        {"$ref": "#/definitions/tool_calls"}  # Tool call
    ],
    "definitions": {
        "tool_calls": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "arguments": {"type": "object"}
                }
            }
        }
    }
}

# Set up guidance
model = og.Model(model_path)
params = og.GeneratorParams(model)

guidance_type = "lark_grammar"
guidance_input = f"""start: %json {json.dumps(schema)}\n"""

params.set_guidance(guidance_type, guidance_input)

Best Practices

Lark grammar is recommended because it supports both regular text output and structured function/tool calls. This gives you maximum flexibility in your applications.
Always verify that tool calling tokens are marked as special in your tokenizer.json. This is critical for proper parsing of structured outputs.
Before deploying, test your JSON schemas thoroughly to ensure they capture all valid outputs and properly constrain invalid ones.
Implement error handling for cases where the model cannot satisfy the constraints. Consider fallback strategies.

Performance Considerations

Constrained decoding adds computational overhead to token generation. The complexity increases with:
  • More complex grammars
  • Larger JSON schemas
  • More intricate regex patterns
For optimal performance:
  • Keep schemas as simple as possible
  • Use specific constraints rather than overly broad ones
  • Test with your expected load to measure impact

Next Steps

Runtime Options

Configure additional runtime settings

Model Builder

Prepare models for constrained decoding

API Reference

Explore the Generator API

Examples

View code examples on GitHub

Build docs developers (and LLMs) love