Parameter-efficient fine-tuning with dynamic LoRA adapter loading, multi-LoRA support, and quantization compatibility
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that enables adapting large language models to specific tasks without modifying the original model weights. Instead of fine-tuning all parameters, LoRA introduces small trainable rank decomposition matrices that are added to existing weights during inference.
from tensorrt_llm import LLMfrom tensorrt_llm.lora_manager import LoraConfigfrom tensorrt_llm.executor.request import LoRARequestfrom tensorrt_llm.sampling_params import SamplingParams# Configure for multiple LoRA adapterslora_config = LoraConfig( lora_target_modules=['attn_q', 'attn_k', 'attn_v'], max_lora_rank=8, max_loras=4, # Up to 4 LoRAs active in GPU simultaneously max_cpu_loras=8 # Up to 8 LoRAs cached in CPU memory)llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", lora_config=lora_config)# Create multiple LoRA requestslora_req1 = LoRARequest("translation", 0, "/path/to/translation_adapter")lora_req2 = LoRARequest("summarization", 1, "/path/to/summarization_adapter")prompts = [ "Translate to French: Hello world", "Summarize: This is a long document about AI..."]# Apply different LoRAs to different promptsoutputs = llm.generate( prompts, sampling_params, lora_request=[lora_req1, lora_req2])
LoRA adapters are applied in full precision (FP16/BF16) even when the base model is quantized. This preserves adapter quality while maintaining memory savings from quantization.
Fine-tune LoRA cache sizes for optimal performance:
from tensorrt_llm.llmapi.llm_args import PeftCacheConfig# Customize cache sizespeft_cache_config = PeftCacheConfig( host_cache_size=1024*1024*1024, # 1GB CPU cache for LoRA weights device_cache_percent=0.1 # Use 10% of free GPU memory for LoRA cache)llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", lora_config=lora_config, peft_cache_config=peft_cache_config)
host_cache_size
Controls CPU memory allocated for caching inactive LoRA adapters. Larger values allow more adapters to be cached, reducing load time when switching between adapters.
device_cache_percent
Percentage of free GPU memory dedicated to the LoRA adapter cache. Higher values allow more adapters to be active simultaneously but reduce memory available for KV cache.