Skip to main content
The ONNX Runtime GenAI Model Builder allows you to quickly create optimized and quantized ONNX models that run with ONNX Runtime GenAI.

Supported Model Architectures

The model builder currently supports the following architectures:
  • AMD OLMo
  • ChatGLM
  • DeepSeek
  • ERNIE 4.5
  • Gemma
  • gpt-oss
  • Granite
  • InternLM2
  • Llama
  • Mistral
  • Nemotron
  • Phi
  • Qwen
  • SmolLM3

Installation

The model builder is included in the ONNX Runtime GenAI Python package:
pip install onnxruntime-genai

Basic Usage

View all available options:
python -m onnxruntime_genai.models.builder --help

Converting Models

PyTorch Model from Hugging Face

Convert a model directly from Hugging Face:
python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_to_save_hf_files
Parameters:
  • -m: Model name from Hugging Face
  • -o: Output directory for the ONNX model
  • -p: Precision (fp16, fp32, int4, etc.)
  • -e: Execution provider (cpu, cuda, dml, etc.)
  • -c: Cache directory for Hugging Face files

PyTorch Model from Disk

Convert a locally downloaded PyTorch model:
python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_where_hf_files_are_saved

Customized or Finetuned Model

Convert your custom or finetuned PyTorch model:
python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_to_store_temp_files

GGUF Model

Convert a GGUF model to ONNX format:
python -m onnxruntime_genai.models.builder \
  -m model_name \
  -i path_to_gguf_file \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_for_hf_files

Quantization Options

INT4 Quantization

Convert a pre-quantized INT4 model (AutoGPTQ or AutoAWQ):
python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p int4 \
  -e execution_provider \
  -c cache_dir_to_store_temp_files

Shared Embeddings

Enable weight sharing between embedding layer and language modeling head to reduce model size:
python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p int4 \
  -e cuda \
  --extra_options shared_embeddings=true int4_algo_config=k_quant
Shared embeddings are automatically enabled if tie_word_embeddings=true in the model’s config.json. Cannot be used with exclude_embeds=true or exclude_lm_head=true.

QDQ Pattern Quantization

Use the QDQ (Quantize-Dequantize) pattern for 4-bit quantization:
python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p int4 \
  -e execution_provider \
  --extra_options use_qdq=true

Advanced Options

Config Only Mode

Generate only the configuration files for an existing ONNX model:
python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_for_hf_files \
  --extra_options config_only=true
After running this, modify the genai_config.json file in the output folder as needed.

Exclude Components

Exclude specific model components:
python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options exclude_embeds=true

Include Hidden States

Include last hidden states as model output:
python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options include_hidden_states=true
The last hidden states are also known as embeddings.

CUDA Graph Support

Enable CUDA graph optimization:
python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e cuda \
  --extra_options enable_cuda_graph=true

Disable QKV Fusion

Keep Q/K/V projections separate instead of fusing them:
python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options disable_qkv_fusion=true

LoRA Adapter Support

Convert models with LoRA adapters using PEFT:
python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p fp16 \
  -e execution_provider \
  -c cache_dir_to_store_temp_files \
  --extra_options adapter_path=path_to_adapter_files
  • Base weights should be in path_to_local_folder_on_disk
  • Adapter weights should be in path_to_adapter_files
See the Multi-LoRA guide for runtime usage.

Testing Models

Create a model with reduced layers for testing:

Option 1: Direct Builder Command

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options num_hidden_layers=4

Option 2: Edit config.json

1

Locate the Model Files

Navigate to where the PyTorch model is saved on disk.
2

Edit config.json

Modify num_hidden_layers in config.json to your desired value (e.g., 4 layers).
3

Run the Builder

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_where_hf_files_are_saved

Hugging Face Configuration

Custom Authentication

Disable or use a different Hugging Face token:
python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options hf_token=false

Remote Code Trust

Disable trusting remote code from Hugging Face:
python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options hf_remote=false

Next Steps

Download Models

Learn about other ways to obtain models

Runtime Options

Configure your model at runtime

Multi-LoRA

Use multiple LoRA adapters dynamically

Quickstart

Run your first inference

Build docs developers (and LLMs) love