The ONNX Runtime GenAI Model Builder allows you to quickly create optimized and quantized ONNX models that run with ONNX Runtime GenAI.
Supported Model Architectures
The model builder currently supports the following architectures:
AMD OLMo
ChatGLM
DeepSeek
ERNIE 4.5
Gemma
gpt-oss
Granite
InternLM2
Llama
Mistral
Nemotron
Phi
Qwen
SmolLM3
Installation
The model builder is included in the ONNX Runtime GenAI Python package:
pip install onnxruntime-genai
Basic Usage
View all available options:
python -m onnxruntime_genai.models.builder --help
Converting Models
PyTorch Model from Hugging Face
Convert a model directly from Hugging Face:
python -m onnxruntime_genai.models.builder \
-m model_name \
-o path_to_output_folder \
-p precision \
-e execution_provider \
-c cache_dir_to_save_hf_files
Parameters:
-m: Model name from Hugging Face
-o: Output directory for the ONNX model
-p: Precision (fp16, fp32, int4, etc.)
-e: Execution provider (cpu, cuda, dml, etc.)
-c: Cache directory for Hugging Face files
PyTorch Model from Disk
Convert a locally downloaded PyTorch model:
python -m onnxruntime_genai.models.builder \
-m model_name \
-o path_to_output_folder \
-p precision \
-e execution_provider \
-c cache_dir_where_hf_files_are_saved
Customized or Finetuned Model
Convert your custom or finetuned PyTorch model:
python -m onnxruntime_genai.models.builder \
-i path_to_local_folder_on_disk \
-o path_to_output_folder \
-p precision \
-e execution_provider \
-c cache_dir_to_store_temp_files
GGUF Model
Convert a GGUF model to ONNX format:
python -m onnxruntime_genai.models.builder \
-m model_name \
-i path_to_gguf_file \
-o path_to_output_folder \
-p precision \
-e execution_provider \
-c cache_dir_for_hf_files
Quantization Options
INT4 Quantization
Convert a pre-quantized INT4 model (AutoGPTQ or AutoAWQ):
python -m onnxruntime_genai.models.builder \
-i path_to_local_folder_on_disk \
-o path_to_output_folder \
-p int4 \
-e execution_provider \
-c cache_dir_to_store_temp_files
Shared Embeddings
Enable weight sharing between embedding layer and language modeling head to reduce model size:
INT4 with K-Quant
INT4 + INT8 Embeddings
FP16 Embeddings
python -m onnxruntime_genai.models.builder \
-m model_name \
-o path_to_output_folder \
-p int4 \
-e cuda \
--extra_options shared_embeddings= true int4_algo_config=k_quant
Shared embeddings are automatically enabled if tie_word_embeddings=true in the model’s config.json. Cannot be used with exclude_embeds=true or exclude_lm_head=true.
QDQ Pattern Quantization
Use the QDQ (Quantize-Dequantize) pattern for 4-bit quantization:
python -m onnxruntime_genai.models.builder \
-i path_to_local_folder_on_disk \
-o path_to_output_folder \
-p int4 \
-e execution_provider \
--extra_options use_qdq= true
Advanced Options
Config Only Mode
Generate only the configuration files for an existing ONNX model:
python -m onnxruntime_genai.models.builder \
-m model_name \
-o path_to_output_folder \
-p precision \
-e execution_provider \
-c cache_dir_for_hf_files \
--extra_options config_only= true
After running this, modify the genai_config.json file in the output folder as needed.
Exclude Components
Exclude specific model components:
Exclude Embeddings
Exclude LM Head
Prune LM Head
python -m onnxruntime_genai.models.builder \
-i path_to_local_folder_on_disk \
-o path_to_output_folder \
-p precision \
-e execution_provider \
--extra_options exclude_embeds= true
Include Hidden States
Include last hidden states as model output:
python -m onnxruntime_genai.models.builder \
-i path_to_local_folder_on_disk \
-o path_to_output_folder \
-p precision \
-e execution_provider \
--extra_options include_hidden_states= true
The last hidden states are also known as embeddings.
CUDA Graph Support
Enable CUDA graph optimization:
python -m onnxruntime_genai.models.builder \
-i path_to_local_folder_on_disk \
-o path_to_output_folder \
-p precision \
-e cuda \
--extra_options enable_cuda_graph= true
Disable QKV Fusion
Keep Q/K/V projections separate instead of fusing them:
python -m onnxruntime_genai.models.builder \
-i path_to_local_folder_on_disk \
-o path_to_output_folder \
-p precision \
-e execution_provider \
--extra_options disable_qkv_fusion= true
LoRA Adapter Support
Convert models with LoRA adapters using PEFT:
python -m onnxruntime_genai.models.builder \
-i path_to_local_folder_on_disk \
-o path_to_output_folder \
-p fp16 \
-e execution_provider \
-c cache_dir_to_store_temp_files \
--extra_options adapter_path=path_to_adapter_files
Base weights should be in path_to_local_folder_on_disk
Adapter weights should be in path_to_adapter_files
See the Multi-LoRA guide for runtime usage.
Testing Models
Create a model with reduced layers for testing:
Option 1: Direct Builder Command
python -m onnxruntime_genai.models.builder \
-m model_name \
-o path_to_output_folder \
-p precision \
-e execution_provider \
--extra_options num_hidden_layers= 4
Option 2: Edit config.json
Locate the Model Files
Navigate to where the PyTorch model is saved on disk.
Edit config.json
Modify num_hidden_layers in config.json to your desired value (e.g., 4 layers).
Run the Builder
python -m onnxruntime_genai.models.builder \
-m model_name \
-o path_to_output_folder \
-p precision \
-e execution_provider \
-c cache_dir_where_hf_files_are_saved
Hugging Face Configuration
Custom Authentication
Disable or use a different Hugging Face token:
python -m onnxruntime_genai.models.builder \
-m model_name \
-o path_to_output_folder \
-p precision \
-e execution_provider \
--extra_options hf_token= false
Remote Code Trust
Disable trusting remote code from Hugging Face:
python -m onnxruntime_genai.models.builder \
-m model_name \
-o path_to_output_folder \
-p precision \
-e execution_provider \
--extra_options hf_remote= false
Next Steps
Download Models Learn about other ways to obtain models
Runtime Options Configure your model at runtime
Multi-LoRA Use multiple LoRA adapters dynamically
Quickstart Run your first inference