Model Builder

The ONNX Runtime GenAI Model Builder allows you to quickly create optimized and quantized ONNX models that run with ONNX Runtime GenAI.

Supported Model Architectures

The model builder currently supports the following architectures:

AMD OLMo
ChatGLM
DeepSeek
ERNIE 4.5
Gemma
gpt-oss
Granite
InternLM2
Llama
Mistral
Nemotron
Phi
Qwen
SmolLM3

Installation

The model builder is included in the ONNX Runtime GenAI Python package:

pip install onnxruntime-genai

Basic Usage

View all available options:

python -m onnxruntime_genai.models.builder --help

Converting Models

PyTorch Model from Hugging Face

Convert a model directly from Hugging Face:

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_to_save_hf_files

Parameters:

-m: Model name from Hugging Face
-o: Output directory for the ONNX model
-p: Precision (fp16, fp32, int4, etc.)
-e: Execution provider (cpu, cuda, dml, etc.)
-c: Cache directory for Hugging Face files

PyTorch Model from Disk

Convert a locally downloaded PyTorch model:

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_where_hf_files_are_saved

Customized or Finetuned Model

Convert your custom or finetuned PyTorch model:

python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_to_store_temp_files

GGUF Model

Convert a GGUF model to ONNX format:

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -i path_to_gguf_file \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_for_hf_files

Quantization Options

INT4 Quantization

Convert a pre-quantized INT4 model (AutoGPTQ or AutoAWQ):

python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p int4 \
  -e execution_provider \
  -c cache_dir_to_store_temp_files

Shared Embeddings

Enable weight sharing between embedding layer and language modeling head to reduce model size:

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p int4 \
  -e cuda \
  --extra_options shared_embeddings=true int4_algo_config=k_quant

Shared embeddings are automatically enabled if tie_word_embeddings=true in the model’s config.json. Cannot be used with exclude_embeds=true or exclude_lm_head=true.

QDQ Pattern Quantization

Use the QDQ (Quantize-Dequantize) pattern for 4-bit quantization:

python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p int4 \
  -e execution_provider \
  --extra_options use_qdq=true

Advanced Options

Config Only Mode

Generate only the configuration files for an existing ONNX model:

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_for_hf_files \
  --extra_options config_only=true

After running this, modify the genai_config.json file in the output folder as needed.

Exclude Components

Exclude specific model components:

python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options exclude_embeds=true

Include Hidden States

Include last hidden states as model output:

python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options include_hidden_states=true

The last hidden states are also known as embeddings.

CUDA Graph Support

Enable CUDA graph optimization:

python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e cuda \
  --extra_options enable_cuda_graph=true

Disable QKV Fusion

Keep Q/K/V projections separate instead of fusing them:

python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options disable_qkv_fusion=true

LoRA Adapter Support

Convert models with LoRA adapters using PEFT:

python -m onnxruntime_genai.models.builder \
  -i path_to_local_folder_on_disk \
  -o path_to_output_folder \
  -p fp16 \
  -e execution_provider \
  -c cache_dir_to_store_temp_files \
  --extra_options adapter_path=path_to_adapter_files

Base weights should be in path_to_local_folder_on_disk
Adapter weights should be in path_to_adapter_files

See the Multi-LoRA guide for runtime usage.

Testing Models

Create a model with reduced layers for testing:

Option 1: Direct Builder Command

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options num_hidden_layers=4

Option 2: Edit config.json

Locate the Model Files

Navigate to where the PyTorch model is saved on disk.

Edit config.json

Modify num_hidden_layers in config.json to your desired value (e.g., 4 layers).

Run the Builder

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  -c cache_dir_where_hf_files_are_saved

Hugging Face Configuration

Custom Authentication

Disable or use a different Hugging Face token:

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options hf_token=false

Remote Code Trust

Disable trusting remote code from Hugging Face:

python -m onnxruntime_genai.models.builder \
  -m model_name \
  -o path_to_output_folder \
  -p precision \
  -e execution_provider \
  --extra_options hf_remote=false

Next Steps

Download Models

Learn about other ways to obtain models

Runtime Options

Configure your model at runtime

Multi-LoRA

Use multiple LoRA adapters dynamically

Quickstart

Run your first inference

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Supported Model Architectures

Installation

Basic Usage

Converting Models

PyTorch Model from Hugging Face

PyTorch Model from Disk

Customized or Finetuned Model

GGUF Model

Quantization Options

INT4 Quantization

Shared Embeddings

QDQ Pattern Quantization

Advanced Options

Config Only Mode

Exclude Components

Include Hidden States

CUDA Graph Support

Disable QKV Fusion

LoRA Adapter Support

Testing Models

Option 1: Direct Builder Command

Option 2: Edit config.json

Hugging Face Configuration

Custom Authentication

Remote Code Trust

Next Steps

Download Models

Runtime Options

Multi-LoRA

Quickstart

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Supported Model Architectures

​Installation

​Basic Usage

​Converting Models

​PyTorch Model from Hugging Face

​PyTorch Model from Disk

​Customized or Finetuned Model

​GGUF Model

​Quantization Options

​INT4 Quantization

​Shared Embeddings

​QDQ Pattern Quantization

​Advanced Options

​Config Only Mode

​Exclude Components

​Include Hidden States

​CUDA Graph Support

​Disable QKV Fusion

​LoRA Adapter Support

​Testing Models

​Option 1: Direct Builder Command

​Option 2: Edit config.json

​Hugging Face Configuration

​Custom Authentication

​Remote Code Trust

​Next Steps

Download Models

Runtime Options

Multi-LoRA

Quickstart

Build docs developers (and LLMs) love

Supported Model Architectures

Installation

Basic Usage

Converting Models

PyTorch Model from Hugging Face

PyTorch Model from Disk

Customized or Finetuned Model

GGUF Model

Quantization Options

INT4 Quantization

Shared Embeddings

QDQ Pattern Quantization

Advanced Options

Config Only Mode

Exclude Components

Include Hidden States

CUDA Graph Support

Disable QKV Fusion

LoRA Adapter Support

Testing Models

Option 1: Direct Builder Command

Option 2: Edit config.json

Hugging Face Configuration

Custom Authentication

Remote Code Trust

Next Steps