Skip to main content
ONNX Runtime GenAI supports Multi-LoRA, allowing you to dynamically load, manage, and switch between multiple LoRA (Low-Rank Adaptation) adapters at runtime without reloading the base model.

Overview

Multi-LoRA support enables:
  • Dynamic adapter loading: Load adapters on-demand without restarting
  • Efficient memory usage: Share the base model across multiple adapters
  • Adapter switching: Change adapters between generations
  • Reference counting: Automatically manage adapter lifecycle

Use Cases

Multi-Tenant Serving

Serve different fine-tuned models to different users while sharing the base model

Task-Specific Adaptation

Switch between adapters optimized for different tasks (summarization, translation, etc.)

A/B Testing

Test different adapter versions without infrastructure changes

Personalization

Provide personalized model behavior per user or session

Preparing LoRA Adapters

First, create your LoRA adapters using the Model Builder:
python -m onnxruntime_genai.models.builder \
  -i path_to_base_model \
  -o path_to_output_folder \
  -p fp16 \
  -e cuda \
  -c cache_dir \
  --extra_options adapter_path=path_to_lora_weights
  • Base model weights should be in path_to_base_model
  • LoRA adapter weights should be in path_to_lora_weights
  • The adapter must be compatible with the base model architecture
See the Model Builder guide for more details.

Using Multi-LoRA at Runtime

Python Example

Here’s a complete example showing how to use multiple LoRA adapters:
import onnxruntime_genai as og

# Load the base model
model = og.Model('path/to/base/model')
tokenizer = og.Tokenizer(model)

# Create the Adapters manager
adapters = og.Adapters.Create(model)

# Load multiple LoRA adapters
adapters.LoadAdapter(
    adapter_file_path='path/to/adapter1/adapter_weights.onnx',
    adapter_name='summarization'
)

adapters.LoadAdapter(
    adapter_file_path='path/to/adapter2/adapter_weights.onnx',
    adapter_name='translation'
)

adapters.LoadAdapter(
    adapter_file_path='path/to/adapter3/adapter_weights.onnx',
    adapter_name='coding'
)

# Set up generation parameters
params = og.GeneratorParams(model)
params.set_search_options(max_length=200)

# Create generator and set active adapter
generator = og.Generator(model, params)

# Use the summarization adapter
generator.SetActiveAdapter(adapters, 'summarization')

# Encode input and generate
prompt = "Summarize this article: ..."
input_tokens = tokenizer.encode(prompt)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()

summary = tokenizer.decode(generator.get_sequence(0))
print(f"Summary: {summary}")

# Switch to a different adapter for the next generation
generator2 = og.Generator(model, params)
generator2.SetActiveAdapter(adapters, 'translation')

# Generate with translation adapter
translation_prompt = "Translate to French: Hello, how are you?"
input_tokens = tokenizer.encode(translation_prompt)
generator2.append_tokens(input_tokens)

while not generator2.is_done():
    generator2.generate_next_token()

translation = tokenizer.decode(generator2.get_sequence(0))
print(f"Translation: {translation}")

# Unload an adapter when no longer needed
# Note: Will fail if the adapter is still in use
adapters.UnloadAdapter('summarization')

C++ Example

#include "ort_genai.h"
#include <iostream>

int main() {
    // Load the base model
    auto model = OgaModel::Create("path/to/base/model");
    auto tokenizer = OgaTokenizer::Create(*model);
    
    // Create Adapters manager
    auto adapters = OgaAdapters::Create(*model);
    
    // Load LoRA adapters
    adapters->LoadAdapter(
        "path/to/adapter1/adapter_weights.onnx",
        "summarization"
    );
    
    adapters->LoadAdapter(
        "path/to/adapter2/adapter_weights.onnx",
        "translation"
    );
    
    // Create generator params
    auto params = OgaGeneratorParams::Create(*model);
    params->SetSearchOption("max_length", 200);
    
    // Create generator with specific adapter
    auto generator = OgaGenerator::Create(*model, *params);
    generator->SetActiveAdapter(*adapters, "summarization");
    
    // Encode and generate
    const char* prompt = "Summarize this article: ...";
    auto input_tokens = tokenizer->Encode(prompt);
    generator->AppendTokens(input_tokens);
    
    while (!generator->IsDone()) {
        generator->GenerateNextToken();
    }
    
    auto output_tokens = generator->GetSequence(0);
    auto output_text = tokenizer->Decode(output_tokens);
    std::cout << "Summary: " << output_text << std::endl;
    
    // Unload adapter
    adapters->UnloadAdapter("summarization");
    
    return 0;
}

C# Example

using Microsoft.ML.OnnxRuntimeGenAI;

// Load the base model
using var model = new Model("path/to/base/model");
using var tokenizer = new Tokenizer(model);

// Create Adapters manager
using var adapters = new Adapters(model);

// Load LoRA adapters
adapters.LoadAdapter(
    adapterFilePath: "path/to/adapter1/adapter_weights.onnx",
    adapterName: "summarization"
);

adapters.LoadAdapter(
    adapterFilePath: "path/to/adapter2/adapter_weights.onnx",
    adapterName: "translation"
);

// Create generator params
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 200);

// Create generator and set adapter
using var generator = new Generator(model, generatorParams);
generator.SetActiveAdapter(adapters, "summarization");

// Generate text
var prompt = "Summarize this article: ...";
var inputTokens = tokenizer.Encode(prompt);
generator.AppendTokens(inputTokens);

while (!generator.IsDone())
{
    generator.GenerateNextToken();
}

var summary = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine($"Summary: {summary}");

// Unload adapter
adapters.UnloadAdapter("summarization");

API Reference

Adapters Class

Create
method
Creates an Adapters manager instance for the given model.Parameters:
  • model: The base model to manage adapters for
Returns: Adapters instance
LoadAdapter
method
Loads a LoRA adapter from disk.Parameters:
  • adapter_file_path: Path to the adapter weights file
  • adapter_name: Unique identifier for this adapter
Throws: Error if adapter name already exists
UnloadAdapter
method
Unloads a previously loaded adapter.Parameters:
  • adapter_name: Name of the adapter to unload
Throws:
  • Error if adapter not found
  • Error if adapter is still in use (ref count > 0)

Generator Methods

SetActiveAdapter
method
Sets the active LoRA adapter for this generator.Parameters:
  • adapters: The Adapters manager instance
  • adapter_name: Name of the adapter to activate
Throws: Error if adapter not found

Best Practices

  • Load adapters at application startup for better performance
  • Unload adapters only when they’re no longer needed across all sessions
  • The library uses reference counting to prevent unloading adapters that are in use
Use descriptive, consistent names for your adapters:
  • task-based: “summarization”, “translation”, “code-generation”
  • user-based: “user_123”, “tenant_abc”
  • version-based: “summarization_v1”, “summarization_v2”
  • Each adapter adds memory overhead (typically small compared to base model)
  • Monitor memory usage when loading many adapters
  • Consider lazy-loading adapters on-demand for large deployments
  • Ensure adapters are created from the same base model
  • Use consistent precision (fp16, fp32) across base model and adapters
  • Verify adapter architecture matches the base model

Performance Tips

1

Pre-load Common Adapters

Load frequently-used adapters at startup to avoid latency during inference.
2

Reuse Generator Instances

When possible, reuse generator instances and just switch adapters rather than creating new generators.
3

Batch Similar Requests

Group requests that use the same adapter together to minimize adapter switching overhead.
4

Monitor Reference Counts

Keep track of which adapters are in use to optimize when to load/unload them.

Troubleshooting

“Adapter still in use” error when unloading:This occurs when trying to unload an adapter that has active references. Ensure all generators using this adapter have completed or been destroyed.
“Adapter not found” error:
  • Verify the adapter name is spelled correctly (case-sensitive)
  • Ensure the adapter was successfully loaded before attempting to use it
  • Check that the adapter hasn’t been unloaded
Memory issues with many adapters:
  • Limit the number of simultaneously loaded adapters
  • Implement an LRU cache to automatically unload least-used adapters
  • Monitor system memory and adapter usage patterns

Next Steps

Model Builder

Learn how to create LoRA adapters

Runtime Options

Configure additional runtime settings

Python API

Explore the Adapters API reference

Examples

View complete examples on GitHub

Build docs developers (and LLMs) love