Multi-LoRA Support

ONNX Runtime GenAI supports Multi-LoRA, allowing you to dynamically load, manage, and switch between multiple LoRA (Low-Rank Adaptation) adapters at runtime without reloading the base model.

Overview

Multi-LoRA support enables:

Dynamic adapter loading: Load adapters on-demand without restarting
Efficient memory usage: Share the base model across multiple adapters
Adapter switching: Change adapters between generations
Reference counting: Automatically manage adapter lifecycle

Use Cases

Multi-Tenant Serving

Serve different fine-tuned models to different users while sharing the base model

Task-Specific Adaptation

Switch between adapters optimized for different tasks (summarization, translation, etc.)

A/B Testing

Test different adapter versions without infrastructure changes

Personalization

Provide personalized model behavior per user or session

Preparing LoRA Adapters

First, create your LoRA adapters using the Model Builder:

python -m onnxruntime_genai.models.builder \
  -i path_to_base_model \
  -o path_to_output_folder \
  -p fp16 \
  -e cuda \
  -c cache_dir \
  --extra_options adapter_path=path_to_lora_weights

Base model weights should be in path_to_base_model
LoRA adapter weights should be in path_to_lora_weights
The adapter must be compatible with the base model architecture

See the Model Builder guide for more details.

Using Multi-LoRA at Runtime

Python Example

Here’s a complete example showing how to use multiple LoRA adapters:

import onnxruntime_genai as og

# Load the base model
model = og.Model('path/to/base/model')
tokenizer = og.Tokenizer(model)

# Create the Adapters manager
adapters = og.Adapters.Create(model)

# Load multiple LoRA adapters
adapters.LoadAdapter(
    adapter_file_path='path/to/adapter1/adapter_weights.onnx',
    adapter_name='summarization'
)

adapters.LoadAdapter(
    adapter_file_path='path/to/adapter2/adapter_weights.onnx',
    adapter_name='translation'
)

adapters.LoadAdapter(
    adapter_file_path='path/to/adapter3/adapter_weights.onnx',
    adapter_name='coding'
)

# Set up generation parameters
params = og.GeneratorParams(model)
params.set_search_options(max_length=200)

# Create generator and set active adapter
generator = og.Generator(model, params)

# Use the summarization adapter
generator.SetActiveAdapter(adapters, 'summarization')

# Encode input and generate
prompt = "Summarize this article: ..."
input_tokens = tokenizer.encode(prompt)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()

summary = tokenizer.decode(generator.get_sequence(0))
print(f"Summary: {summary}")

# Switch to a different adapter for the next generation
generator2 = og.Generator(model, params)
generator2.SetActiveAdapter(adapters, 'translation')

# Generate with translation adapter
translation_prompt = "Translate to French: Hello, how are you?"
input_tokens = tokenizer.encode(translation_prompt)
generator2.append_tokens(input_tokens)

while not generator2.is_done():
    generator2.generate_next_token()

translation = tokenizer.decode(generator2.get_sequence(0))
print(f"Translation: {translation}")

# Unload an adapter when no longer needed
# Note: Will fail if the adapter is still in use
adapters.UnloadAdapter('summarization')

C++ Example

#include "ort_genai.h"
#include <iostream>

int main() {
    // Load the base model
    auto model = OgaModel::Create("path/to/base/model");
    auto tokenizer = OgaTokenizer::Create(*model);
    
    // Create Adapters manager
    auto adapters = OgaAdapters::Create(*model);
    
    // Load LoRA adapters
    adapters->LoadAdapter(
        "path/to/adapter1/adapter_weights.onnx",
        "summarization"
    );
    
    adapters->LoadAdapter(
        "path/to/adapter2/adapter_weights.onnx",
        "translation"
    );
    
    // Create generator params
    auto params = OgaGeneratorParams::Create(*model);
    params->SetSearchOption("max_length", 200);
    
    // Create generator with specific adapter
    auto generator = OgaGenerator::Create(*model, *params);
    generator->SetActiveAdapter(*adapters, "summarization");
    
    // Encode and generate
    const char* prompt = "Summarize this article: ...";
    auto input_tokens = tokenizer->Encode(prompt);
    generator->AppendTokens(input_tokens);
    
    while (!generator->IsDone()) {
        generator->GenerateNextToken();
    }
    
    auto output_tokens = generator->GetSequence(0);
    auto output_text = tokenizer->Decode(output_tokens);
    std::cout << "Summary: " << output_text << std::endl;
    
    // Unload adapter
    adapters->UnloadAdapter("summarization");
    
    return 0;
}

C# Example

using Microsoft.ML.OnnxRuntimeGenAI;

// Load the base model
using var model = new Model("path/to/base/model");
using var tokenizer = new Tokenizer(model);

// Create Adapters manager
using var adapters = new Adapters(model);

// Load LoRA adapters
adapters.LoadAdapter(
    adapterFilePath: "path/to/adapter1/adapter_weights.onnx",
    adapterName: "summarization"
);

adapters.LoadAdapter(
    adapterFilePath: "path/to/adapter2/adapter_weights.onnx",
    adapterName: "translation"
);

// Create generator params
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 200);

// Create generator and set adapter
using var generator = new Generator(model, generatorParams);
generator.SetActiveAdapter(adapters, "summarization");

// Generate text
var prompt = "Summarize this article: ...";
var inputTokens = tokenizer.Encode(prompt);
generator.AppendTokens(inputTokens);

while (!generator.IsDone())
{
    generator.GenerateNextToken();
}

var summary = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine($"Summary: {summary}");

// Unload adapter
adapters.UnloadAdapter("summarization");

API Reference

Adapters Class

Create

method

Creates an Adapters manager instance for the given model.Parameters:

model: The base model to manage adapters for

Returns: Adapters instance

LoadAdapter

method

Loads a LoRA adapter from disk.Parameters:

adapter_file_path: Path to the adapter weights file
adapter_name: Unique identifier for this adapter

Throws: Error if adapter name already exists

UnloadAdapter

method

Unloads a previously loaded adapter.Parameters:

adapter_name: Name of the adapter to unload

Throws:

Error if adapter not found
Error if adapter is still in use (ref count > 0)

Generator Methods

SetActiveAdapter

method

Sets the active LoRA adapter for this generator.Parameters:

adapters: The Adapters manager instance
adapter_name: Name of the adapter to activate

Throws: Error if adapter not found

Best Practices

Adapter Lifecycle Management

Load adapters at application startup for better performance
Unload adapters only when they’re no longer needed across all sessions
The library uses reference counting to prevent unloading adapters that are in use

Naming Convention

Use descriptive, consistent names for your adapters:

task-based: “summarization”, “translation”, “code-generation”
user-based: “user_123”, “tenant_abc”
version-based: “summarization_v1”, “summarization_v2”

Memory Considerations

Each adapter adds memory overhead (typically small compared to base model)
Monitor memory usage when loading many adapters
Consider lazy-loading adapters on-demand for large deployments

Adapter Compatibility

Ensure adapters are created from the same base model
Use consistent precision (fp16, fp32) across base model and adapters
Verify adapter architecture matches the base model

Performance Tips

Pre-load Common Adapters

Load frequently-used adapters at startup to avoid latency during inference.

Reuse Generator Instances

When possible, reuse generator instances and just switch adapters rather than creating new generators.

Batch Similar Requests

Group requests that use the same adapter together to minimize adapter switching overhead.

Monitor Reference Counts

Keep track of which adapters are in use to optimize when to load/unload them.

Troubleshooting

“Adapter still in use” error when unloading:This occurs when trying to unload an adapter that has active references. Ensure all generators using this adapter have completed or been destroyed.

“Adapter not found” error:

Verify the adapter name is spelled correctly (case-sensitive)
Ensure the adapter was successfully loaded before attempting to use it
Check that the adapter hasn’t been unloaded

Memory issues with many adapters:

Limit the number of simultaneously loaded adapters
Implement an LRU cache to automatically unload least-used adapters
Monitor system memory and adapter usage patterns

Next Steps

Model Builder

Learn how to create LoRA adapters

Runtime Options

Configure additional runtime settings

Python API

Explore the Adapters API reference

Examples

View complete examples on GitHub

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Overview

Use Cases

Multi-Tenant Serving

Task-Specific Adaptation

A/B Testing

Personalization

Preparing LoRA Adapters

Using Multi-LoRA at Runtime

Python Example

C++ Example

C# Example

API Reference

Adapters Class

Generator Methods

Best Practices

Performance Tips

Troubleshooting

Next Steps

Model Builder

Runtime Options

Python API

Examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Overview

​Use Cases

Multi-Tenant Serving

Task-Specific Adaptation

A/B Testing

Personalization

​Preparing LoRA Adapters

​Using Multi-LoRA at Runtime

​Python Example

​C++ Example

​C# Example

​API Reference

​Adapters Class

​Generator Methods

​Best Practices

​Performance Tips

​Troubleshooting

​Next Steps

Model Builder

Runtime Options

Python API

Examples

Build docs developers (and LLMs) love

Overview

Use Cases

Preparing LoRA Adapters

Using Multi-LoRA at Runtime

Python Example

C++ Example

C# Example

API Reference

Adapters Class

Generator Methods

Best Practices

Performance Tips

Troubleshooting

Next Steps