Quick Start
Basic LoRA Serving
Launch a server with a single LoRA adapter:Multiple Adapters
Serve multiple LoRA adapters simultaneously:Configuration Parameters
Server Arguments
| Parameter | Description | Default |
|---|---|---|
--enable-lora | Enable LoRA support | Auto-enabled if --lora-paths provided |
--lora-paths | List of LoRA adapters to load at startup | None |
--max-loras-per-batch | Maximum adapters per batch | 8 |
--max-lora-rank | Maximum LoRA rank to support | Auto-inferred from adapters |
--lora-target-modules | Target modules for LoRA (e.g., q_proj, k_proj) | Auto-inferred or all |
--lora-backend | Backend: triton or csgmv | csgmv |
--max-loaded-loras | Maximum adapters in CPU memory | Unlimited |
--lora-eviction-policy | Eviction policy: lru or fifo | lru |
--enable-lora-overlap-loading | Overlap H2D transfers with compute | False |
--max-lora-chunk-size | Chunk size for ChunkedSGMV backend | 16 |
LoRA Path Formats
You can specify adapters in multiple formats:Dynamic Adapter Management
Load and unload adapters at runtime without restarting the server.Initial Server Setup
When using dynamic loading, explicitly specify
--max-lora-rank and --lora-target-modules to ensure compatibility with all adapters you plan to load.Load Adapter
Unload Adapter
OpenAI-Compatible API
Use LoRA adapters through the OpenAI-compatible API by specifying the adapter name with a colon separator:Advanced Features
GPU Pinning
Pin frequently-used adapters to GPU memory to avoid repeated loading:Backend Selection
SGLang supports two LoRA backends:ChunkedSGMV (csgmv)
Default and recommended. Optimized for high concurrency with 20-80% latency improvements.
Triton
Basic Triton-based implementation. Use for compatibility if needed.
Overlap Loading
Overlap LoRA weight loading with GPU computation to hide data movement latency:When to Use Overlap Loading
When to Use Overlap Loading
Enable when:
- High adapter churn (frequently switching adapters)
- Large adapter weights (high rank)
- PCIe-bottlenecked workloads
Trade-offs
Trade-offs
Pros:
- Reduces adapter load time impact
- Hides H2D transfer latency
- Requires pinned CPU memory (limits
max-loaded-lorasto 2×max-loras-per-batch) - Reduces multi-adapter prefill batching (may increase TTFT when load time << prefill time)
Implementation Architecture
SGLang’s LoRA implementation consists of several key components:LoRAManager
TheLoRAManager class coordinates adapter lifecycle:
python/sglang/srt/lora/lora_manager.py:50
Memory Pool
The memory pool manages GPU memory allocation for adapter weights, implementing eviction policies (LRU or FIFO) when the pool is full.Adapter Format
Adapters must follow the PEFT format with:adapter_config.json- Configuration (rank, target modules, alpha)- Weight files - Adapter matrices (A and B)
- Optional
added_tokens.json- Additional vocabulary tokens
python/sglang/srt/lora/lora_config.py:22
Tensor Parallelism
LoRA serving supports tensor parallelism for large models:Performance Best Practices
1. Choose Appropriate max-loras-per-batch
1. Choose Appropriate max-loras-per-batch
Set based on your concurrency needs. Higher values support more concurrent adapters but increase memory usage.
2. Pin Frequently-Used Adapters
2. Pin Frequently-Used Adapters
Pin adapters that are accessed in >50% of requests to avoid repeated loading.
3. Use ChunkedSGMV Backend
3. Use ChunkedSGMV Backend
The
csgmv backend provides 20-80% better latency than triton at high concurrency.4. Tune Eviction Policy
4. Tune Eviction Policy
Use
lru (default) for workloads with temporal locality. Use fifo for uniform access patterns.5. Monitor Adapter Load Bottlenecks
5. Monitor Adapter Load Bottlenecks
If adapter loading is a bottleneck, enable
--enable-lora-overlap-loading.Limitations
Future Development
Upcoming features tracked in GitHub Issue #2929:- Embedding layer LoRA
- Unified paging for adapters
- CUTLASS backend for improved performance
- Expanded target module support
