Constrain model outputs using JSON schemas, regex, and EBNF grammars
SGLang enables you to constrain model outputs to follow specific formats using JSON schemas, regular expressions, or EBNF grammars. The model output is guaranteed to follow the specified constraints.
SGLang supports three grammar backends for constrained generation:
XGrammar
Default backend - Best performance and utility. Supports JSON schema, regex, and EBNF.
Outlines
Supports JSON schema and regex constraints.
Llguidance
Supports JSON schema, regex, and EBNF constraints.
To select a backend, use --grammar-backend when launching the server:
# Use default XGrammar backendpython -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct# Use Outlines backendpython -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --grammar-backend outlines# Use Llguidance backendpython -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --grammar-backend llguidance
For better output quality, explicitly include instructions in your prompt to guide the model. For example: “Please generate the output in the following JSON format: …”
from pydantic import BaseModel, Fieldimport openaiclass CapitalInfo(BaseModel): name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city") population: int = Field(..., description="Population of the capital city")client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")response = client.chat.completions.create( model="meta-llama/Meta-Llama-3.1-8B-Instruct", messages=[ { "role": "user", "content": "Please generate the information of the capital of France in JSON format.", }, ], temperature=0, max_tokens=128, response_format={ "type": "json_schema", "json_schema": { "name": "capital_info", "schema": CapitalInfo.model_json_schema(), }, },)# Validate the responsecapital_info = CapitalInfo.model_validate_json(response.choices[0].message.content)print(capital_info.model_dump_json())
Define custom grammars using Extended Backus-Naur Form (EBNF) notation. XGrammar uses the GGML BNF format.
ebnf_grammar = """root ::= city | descriptioncity ::= "London" | "Paris" | "Berlin" | "Rome"description ::= city " is " statusstatus ::= "the capital of " countrycountry ::= "England" | "France" | "Germany" | "Italy""""response = client.chat.completions.create( model="meta-llama/Meta-Llama-3.1-8B-Instruct", messages=[ {"role": "system", "content": "You are a helpful geography bot."}, { "role": "user", "content": "Give me the information of the capital of France.", }, ], temperature=0, max_tokens=32, extra_body={"ebnf": ebnf_grammar},)print(response.choices[0].message.content)# Output: "Paris is the capital of France"
SGLang’s constrained generation is implemented through the GrammarManager which:
Compiles grammars - Converts JSON schemas, regex, or EBNF into efficient grammar objects
Caches compiled grammars - Reuses compiled grammars across requests for better performance
Applies constraints during generation - Modifies logits to ensure only valid tokens are sampled
Supports jump-forward optimization - Skips ahead when only one valid continuation exists
The grammar compilation happens asynchronously to avoid blocking request processing. Requests wait in a grammar queue until their grammar objects are ready.Source: python/sglang/srt/constrained/grammar_manager.py:24