Constrained decoding is useful when using function/tool calling as it helps ensure the output is in the correct format and guarantees structured outputs.
Overview
ONNX Runtime GenAI integrates LLGuidance for constrained decoding, enabling you to control the format and structure of model outputs.
Constraint Types
There are three types of constrained decoding available:
Lark Grammar Recommended - Allows both regular output and function/tool output in JSON format
JSON Schema Output will match JSON schema and be one of the provided functions/tools
Regex Match a specific regular expression pattern
Configuration
Tokenizer Modification
To ensure function/tool calling works correctly with constrained decoding, you need to modify your tokenizer.json file.
For each model that has its own tool calling token, set the token’s special attribute to true.
Example: Phi-4 mini uses <|tool_call|> and <|/tool_call|> tokens, so you should set the special attribute for them as true inside tokenizer.json.
Using Lark Grammar (Recommended)
Lark grammar provides the most flexibility, allowing both regular and structured outputs.
Example: JSON Schema with Lark Grammar
import argparse
import json
import onnxruntime_genai as og
from datasets import load_dataset
# Load a JSON schema from a dataset
dataset = load_dataset( path = "epfl-dlab/JSONSchemaBench" , name = "Github_hard" , split = "test" )
schema = json.loads(dataset[ 0 ][ "json_schema" ])
# Set up model and tokenizer
config = og.Config(model_path)
model = og.Model(config)
tokenizer = og.Tokenizer(model)
# Configure search options
search_options = {
"batch_size" : 1 ,
"temperature" : 0.0 ,
"max_length" : 2048
}
# Create generator params
params = og.GeneratorParams(model)
params.set_search_options( ** search_options)
# Configure guidance with Lark grammar
schema[ "x-guidance" ] = {
"whitespace_flexible" : False ,
"key_separator" : ": " ,
"item_separator" : ", "
}
guidance_type = "lark_grammar"
guidance_input = f """start: %json { json.dumps(schema) } \n """
# Set guidance on the params
params.set_guidance(guidance_type, guidance_input, enable_ff_tokens = False )
# Create generator and run inference
generator = og.Generator(model, params)
# Apply chat template and encode
messages = [
{ "role" : "system" , "content" : "You need to generate a JSON object that matches the schema below." },
{ "role" : "user" , "content" : json.dumps(schema, indent = 2 )}
]
final_prompt = tokenizer.apply_chat_template( messages = json.dumps(messages), add_generation_prompt = True )
final_input = tokenizer.encode(final_prompt)
generator.append_tokens(final_input)
# Generate tokens
while not generator.is_done():
generator.generate_next_token()
# Get the result
output_tokens = generator.get_sequence( 0 )
output_text = tokenizer.decode(output_tokens)
# Verify valid JSON
result = json.loads(output_text)
print (json.dumps(result, indent = 2 ))
Using JSON Schema
JSON schema constraints ensure outputs conform to a specific structure.
import onnxruntime_genai as og
import json
# Define your JSON schema
schema = {
"type" : "object" ,
"properties" : {
"name" : { "type" : "string" },
"age" : { "type" : "integer" },
"email" : { "type" : "string" , "format" : "email" }
},
"required" : [ "name" , "age" ]
}
# Set up model
model = og.Model(model_path)
params = og.GeneratorParams(model)
# Set JSON schema guidance
guidance_type = "json_schema"
guidance_input = json.dumps(schema)
params.set_guidance(guidance_type, guidance_input)
# Generate with constraints
generator = og.Generator(model, params)
# ... run generation
Using Regex Constraints
Regex constraints allow you to match specific patterns.
import onnxruntime_genai as og
# Set up model
model = og.Model(model_path)
params = og.GeneratorParams(model)
# Set regex guidance (e.g., email pattern)
guidance_type = "regex"
guidance_input = r " [ a-zA-Z0-9._%+- ] + @ [ a-zA-Z0-9.- ] + \. [ a-zA-Z ] {2,} "
params.set_guidance(guidance_type, guidance_input)
# Generate with constraints
generator = og.Generator(model, params)
# ... run generation
Constrained decoding is particularly useful for function and tool calling scenarios.
import onnxruntime_genai as og
import json
# Define available tools
tools = [
{
"type" : "function" ,
"function" : {
"name" : "get_weather" ,
"description" : "Get the current weather for a location" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"location" : { "type" : "string" },
"unit" : { "type" : "string" , "enum" : [ "celsius" , "fahrenheit" ]}
},
"required" : [ "location" ]
}
}
}
]
# Create schema that allows either text or tool calls
schema = {
"oneOf" : [
{ "type" : "string" }, # Regular text response
{ "$ref" : "#/definitions/tool_calls" } # Tool call
],
"definitions" : {
"tool_calls" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"properties" : {
"name" : { "type" : "string" },
"arguments" : { "type" : "object" }
}
}
}
}
}
# Set up guidance
model = og.Model(model_path)
params = og.GeneratorParams(model)
guidance_type = "lark_grammar"
guidance_input = f """start: %json { json.dumps(schema) } \n """
params.set_guidance(guidance_type, guidance_input)
Best Practices
Use Lark Grammar for Flexibility
Lark grammar is recommended because it supports both regular text output and structured function/tool calls. This gives you maximum flexibility in your applications.
Validate Tokenizer Configuration
Always verify that tool calling tokens are marked as special in your tokenizer.json. This is critical for proper parsing of structured outputs.
Before deploying, test your JSON schemas thoroughly to ensure they capture all valid outputs and properly constrain invalid ones.
Implement error handling for cases where the model cannot satisfy the constraints. Consider fallback strategies.
Constrained decoding adds computational overhead to token generation. The complexity increases with:
More complex grammars
Larger JSON schemas
More intricate regex patterns
For optimal performance:
Keep schemas as simple as possible
Use specific constraints rather than overly broad ones
Test with your expected load to measure impact
Next Steps
Runtime Options Configure additional runtime settings
Model Builder Prepare models for constrained decoding
API Reference Explore the Generator API
Examples View code examples on GitHub