Skip to main content
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models to specific tasks without modifying the original model weights. Instead of fine-tuning all parameters, LoRA introduces small trainable rank decomposition matrices that are added to existing weights during inference.

What is LoRA?

LoRA decomposes weight updates into low-rank matrices:
W' = W + BA
Where:
  • W is the original pre-trained weight matrix
  • B and A are low-rank matrices (rank r << min(d_in, d_out))
  • Only B and A are trained and stored (massive parameter reduction)
For a 7B parameter model, a LoRA adapter with rank 8 typically adds only ~8-16M parameters (0.1-0.2% of original model size).

Quick Start

Single LoRA Adapter

from tensorrt_llm import LLM
from tensorrt_llm.lora_manager import LoraConfig
from tensorrt_llm.executor.request import LoRARequest
from tensorrt_llm.sampling_params import SamplingParams

# Configure LoRA
lora_config = LoraConfig(
    lora_dir=["/path/to/lora/adapter"],
    max_lora_rank=8,
    max_loras=1,
    max_cpu_loras=1
)

# Initialize LLM with LoRA support
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    lora_config=lora_config
)

# Create LoRA request
lora_request = LoRARequest("my-lora-task", 0, "/path/to/lora/adapter")

# Generate with LoRA
prompts = ["Translate to French: Hello, how are you?"]
sampling_params = SamplingParams(max_tokens=50)

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=[lora_request]
)

for output in outputs:
    print(output.text)

Multi-LoRA Support

Serve multiple LoRA adapters simultaneously:
from tensorrt_llm import LLM
from tensorrt_llm.lora_manager import LoraConfig
from tensorrt_llm.executor.request import LoRARequest
from tensorrt_llm.sampling_params import SamplingParams

# Configure for multiple LoRA adapters
lora_config = LoraConfig(
    lora_target_modules=['attn_q', 'attn_k', 'attn_v'],
    max_lora_rank=8,
    max_loras=4,        # Up to 4 LoRAs active in GPU simultaneously
    max_cpu_loras=8     # Up to 8 LoRAs cached in CPU memory
)

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    lora_config=lora_config
)

# Create multiple LoRA requests
lora_req1 = LoRARequest("translation", 0, "/path/to/translation_adapter")
lora_req2 = LoRARequest("summarization", 1, "/path/to/summarization_adapter")

prompts = [
    "Translate to French: Hello world",
    "Summarize: This is a long document about AI..."
]

# Apply different LoRAs to different prompts
outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=[lora_req1, lora_req2]
)

Configuration Options

LoraConfig Parameters

ParameterTypeDescription
lora_dirList[str]Paths to LoRA adapter directories
lora_target_modulesList[str]Which modules to apply LoRA to (e.g., ['attn_q', 'attn_k', 'attn_v'])
max_lora_rankintMaximum rank of LoRA adapters
max_lorasintMaximum number of LoRAs active on GPU simultaneously
max_cpu_lorasintMaximum number of LoRAs cached in CPU memory
lora_ckpt_sourcestrFormat of LoRA checkpoint: "hf" (HuggingFace) or "nemo" (NeMo)
trtllm_modules_to_hf_modulesDictMapping from TRT-LLM module names to HuggingFace names
max_cpu_loras should be >= max_loras. The system maintains a cache in CPU memory and swaps LoRAs to GPU as needed.

Advanced Usage

LoRA with Quantization

LoRA works seamlessly with quantized models:
from tensorrt_llm import LLM
from tensorrt_llm.lora_manager import LoraConfig
from tensorrt_llm.models.modeling_utils import QuantConfig
from tensorrt_llm.quantization.mode import QuantAlgo

# Configure FP8 quantization
quant_config = QuantConfig(
    quant_algo=QuantAlgo.FP8,
    kv_cache_quant_algo=QuantAlgo.FP8
)

# Configure LoRA
lora_config = LoraConfig(
    lora_dir=["/path/to/adapter"],
    max_lora_rank=8,
    max_loras=1
)

# LoRA works with quantized models
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quant_config=quant_config,
    lora_config=lora_config
)
LoRA adapters are applied in full precision (FP16/BF16) even when the base model is quantized. This preserves adapter quality while maintaining memory savings from quantization.

NeMo LoRA Format

Support for NeMo-format LoRA checkpoints:
from tensorrt_llm.lora_manager import LoraConfig
from tensorrt_llm.executor.request import LoRARequest

# Configure for NeMo format
lora_config = LoraConfig(
    lora_dir=["/path/to/nemo/lora"],
    lora_ckpt_source="nemo",
    max_lora_rank=8
)

lora_request = LoRARequest(
    "nemo-task",
    0,
    "/path/to/nemo/lora",
    lora_ckpt_source="nemo"
)

Cache Management

Fine-tune LoRA cache sizes for optimal performance:
from tensorrt_llm.llmapi.llm_args import PeftCacheConfig

# Customize cache sizes
peft_cache_config = PeftCacheConfig(
    host_cache_size=1024*1024*1024,  # 1GB CPU cache for LoRA weights
    device_cache_percent=0.1          # Use 10% of free GPU memory for LoRA cache
)

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    lora_config=lora_config,
    peft_cache_config=peft_cache_config
)
Controls CPU memory allocated for caching inactive LoRA adapters. Larger values allow more adapters to be cached, reducing load time when switching between adapters.
Percentage of free GPU memory dedicated to the LoRA adapter cache. Higher values allow more adapters to be active simultaneously but reduce memory available for KV cache.

Serving with trtllm-serve

YAML Configuration

Create a config.yaml file:
lora_config:
  lora_target_modules: ['attn_q', 'attn_k', 'attn_v']
  max_lora_rank: 8
  max_loras: 4
  max_cpu_loras: 8

Starting the Server

trtllm-serve meta-llama/Llama-3.1-8B-Instruct --config config.yaml

Client Usage

Send requests with LoRA adapters:
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="What is the capital city of France?",
    max_tokens=20,
    extra_body={
        "lora_request": {
            "lora_name": "lora-example-0",
            "lora_int_id": 0,
            "lora_path": "/path/to/lora_adapter"
        }
    },
)

print(response.choices[0].text)

Benchmarking with trtllm-bench

YAML Configuration

lora_config:
  lora_dir:
    - /path/to/loras/0
    - /path/to/loras/1
  max_lora_rank: 64
  max_loras: 8
  max_cpu_loras: 16
  lora_target_modules:
    - attn_q
    - attn_k
    - attn_v
  trtllm_modules_to_hf_modules:
    attn_q: q_proj
    attn_k: k_proj
    attn_v: v_proj

Run Benchmark

trtllm-bench \
  --model meta-llama/Llama-3.1-8B-Instruct \
  throughput \
  --dataset /path/to/dataset.json \
  --config config.yaml \
  --num_requests 64 \
  --concurrency 16

Target Modules

Commonly used LoRA target modules:
lora_target_modules = ['attn_q', 'attn_k', 'attn_v']
  • Lightweight adaptation
  • Good for task-specific tuning
  • Minimal memory overhead
Module names may vary by model architecture. Use trtllm_modules_to_hf_modules to map TRT-LLM names to HuggingFace names if needed.

Performance Considerations

Lower ranks (4-8):
  • Faster inference
  • Smaller adapter files
  • Sufficient for most tasks
Higher ranks (16-64):
  • Better adaptation capacity
  • Slower inference
  • Use for complex domain adaptation
  • max_loras controls GPU memory usage
  • More active LoRAs → less memory for KV cache
  • Start with 4-8 and tune based on workload
  • Use max_cpu_loras for larger adapter pools
  • Increase host_cache_size if frequently switching adapters
  • Increase device_cache_percent if many adapters are used concurrently
  • Monitor adapter swap times and adjust accordingly

Best Practices

1

Start with attention-only modules

Begin with ['attn_q', 'attn_k', 'attn_v'] for most tasks. This provides good adaptation with minimal overhead.
2

Use rank 8 as baseline

Rank 8 offers a good balance between adaptation capacity and inference speed for most use cases.
3

Configure cache sizes appropriately

Set max_cpu_loras to 2-4x your max_loras to allow efficient adapter swapping.
4

Combine with quantization

Use FP8 or INT4 quantization for the base model to maximize memory savings while maintaining LoRA quality.
5

Monitor adapter usage

Track which adapters are most frequently used and prioritize keeping them in GPU memory.

Limitations

  • LoRA adapters must have rank ≤ max_lora_rank configured at model load time
  • All adapters must target the same modules specified in lora_target_modules
  • Adapter switching has a small overhead (loading from CPU/disk to GPU)
  • Maximum number of simultaneously active adapters is limited by GPU memory

Additional Resources

LoRA Paper

Original LoRA: Low-Rank Adaptation of Large Language Models

HuggingFace PEFT

Training LoRA adapters with PEFT library

LoRA Adapters Hub

Browse pre-trained LoRA adapters

Build docs developers (and LLMs) love