Runtime Configuration
This page documents runtime configuration options that control the TensorRT-LLM execution engine, scheduling policies, KV cache management, and performance optimizations.Overview
Runtime configuration includes:- Executor Configuration: Controls the execution engine behavior
- Scheduler Configuration: Manages request scheduling and batching
- KV Cache Configuration: Controls KV cache memory management
- Performance Knobs: Fine-tunes runtime performance characteristics
Scheduler Configuration
TheSchedulerConfig class controls how requests are scheduled and batched.
Basic Configuration
Capacity Scheduler Policy
The capacity scheduler policy to use.Options:
-
MAX_UTILIZATION: Maximize GPU utilization by accepting as many requests as possible. May evict in-progress requests if resources are needed. -
GUARANTEED_NO_EVICT: Never evict in-progress requests. Only accepts new requests if they can complete without eviction. Recommended for production. -
STATIC_BATCH: Static batching mode. Batches are formed and executed atomically.
Context Chunking Policy
The context chunking policy for handling long prompts.Options:Note: Only applicable when
-
FIRST_COME_FIRST_SERVED: Process requests in arrival order. Earlier requests get priority for chunking. -
EQUAL_PROGRESS: Try to make equal progress across all requests. Balances fairness across requests.
enable_chunked_prefill=True.Dynamic Batch Configuration
Dynamic batch configuration. Allows runtime to automatically tune batch size and token limits.DynamicBatchConfig fields:Note: This only applies to the TensorRT backend and cannot currently be used with the PyTorch backend.
enable_batch_size_tuning(bool, default:true): Controls if batch size should be tuned dynamicallyenable_max_num_tokens_tuning(bool, default:false): Controls if max num tokens should be tuned dynamicallydynamic_batch_moving_average_window(int, default:128): Window size for moving average of input/output lengths
Waiting Queue Policy
The waiting queue scheduling policy for managing pending requests.Options:
FCFS: First-Come-First-Served
Extended Runtime Performance Knobs
ExtendedRuntimePerfKnobConfig provides fine-grained control over runtime performance optimizations.
Multi-Block Mode
Whether to use multi-block mode for attention computation.Benefits:
- Improved occupancy for long sequences
- Better performance on modern GPUs
Context FMHA FP32 Accumulation
Whether to enable FP32 accumulation in context phase Flash Multi-Head Attention.Benefits:
- Improved numerical accuracy for long context
- Reduced likelihood of overflow/underflow
- Slightly slower performance
- Higher memory usage during attention computation
CUDA Graph Mode
Whether to use CUDA graph mode for kernel execution.Benefits:
- Reduced kernel launch overhead
- Improved performance for small batch sizes
- Only applies to generation phase (not prefill)
- Requires fixed shapes for captured operations
Number of CUDA graphs to cache in the runtime.Benefits:
- Larger cache → better performance for varying batch sizes
- Avoids re-capturing graphs for common shapes
- Each graph consumes ~200 MB of GPU memory
- Set to
8-16for typical serving workloads - Increase if you have many unique batch sizes and spare GPU memory
- Set to
0to disable CUDA graph caching
PEFT Cache Configuration
PeftCacheConfig controls caching for PEFT adapters (LoRA).
Cache Sizing
Number of max-sized 1-layer 1-module sets of weights that can be stored in device (GPU) cache.Calculation:
Actual GPU memory used =
num_device_module_layer × max_adapter_size × model_layer_sizeRecommendation:- Start with
8-16for typical serving - Increase if you have many concurrent LoRA adapters
Number of max-sized 1-layer 1-module sets of weights that can be stored in host (CPU) cache.Benefits:
- Faster adapter swapping than loading from disk
- Supports many more adapters than device cache alone
- Set to
32-128for multi-tenant serving - Higher values if you have spare CPU memory
Adapter Configuration
Optimal adapter size used to set page width.Purpose:
- Determines memory page size for efficient adapter storage
- Should match the most common LoRA rank in your workload
8 or 16).Maximum supported adapter size (LoRA rank).Purpose:
- Sets upper bound on LoRA ranks
- Affects minimum cache page size
Worker Configuration
Number of worker threads used to put weights into host cache.Recommendation: Increase to
2-4 if adapter loading becomes a bottleneck.Number of worker threads used to copy weights from host to device.Recommendation: Increase to
2-4 if adapter transfers become a bottleneck.Number of CUDA streams used to copy weights from host to device.Recommendation: Increase to
2-4 for better overlap with computation.Memory Allocation
Number of cache pages per allocation block (host).
Number of cache pages per allocation block (device).
Proportion of free device memory after engine load to use for PEFT cache (0.0 to 1.0).Default:
0.02 = 2% of free GPU memoryRecommendation:- Increase to
0.05-0.10if serving many concurrent adapters - Decrease if GPU memory is tight
Size in bytes to use for host cache.Default:
1073741824 = 1 GBRecommendation:- Increase if serving many adapters (e.g.,
8 GBor more) - Calculate based on:
num_adapters × adapter_size × model_size_per_rank
Cache Transceiver Configuration
CacheTransceiverConfig controls KV cache exchange for disaggregated serving.
Communication Backend
The communication backend type to use for KV cache transfer.Options:
DEFAULT: Use default backend (typically NIXL)NIXL: NVIDIA InfiniBand eXtensions Library (recommended for InfiniBand)UCX: Unified Communication X (alternative for InfiniBand/RoCE)MPI: MPI-based communicationMOONCAKE: Mooncake backend (experimental)
NIXL for best performance on InfiniBand clusters.The runtime implementation.Options:
CPP: C++ transceiver (default when not set, better performance)PYTHON: Python transceiver (easier debugging)
CPP for production.Buffer Configuration
The maximum number of tokens the transfer buffer can fit.Purpose:
- Controls memory allocation for KV cache transfer buffers
- Larger buffers → fewer transfers but more memory
- Set to
4096-8192for typical workloads - Increase for very long sequences
Timeout Configuration
Timeout in milliseconds for KV cache transfer.Purpose:
- Requests exceeding this timeout will be cancelled
- Prevents indefinite hangs on network issues
- Set to
5000-10000ms for typical networks - Increase for high-latency networks or very large transfers
Timeout in milliseconds to wait for the sender future to be ready when scheduled batch size is 0.Purpose:
- Allows requests to be eventually cancelled by the user or
kv_transfer_timeout_ms
Batching Configuration
Batching Type
The batching strategy.Options:Note: This is a TensorRT backend parameter.
-
INFLIGHT: In-flight batching (continuous batching). Requests enter and leave the batch dynamically. Recommended for throughput. -
STATIC: Static batching. Batches are formed once and executed atomically. Recommended for latency-sensitive workloads.
Example Configurations
High-Throughput Serving
Low-Latency Serving
Multi-LoRA Serving
Disaggregated Serving
See Also
- LLM Arguments Configuration - Complete LlmArgs reference
- KV Cache Configuration - KV cache management guide
- Disaggregated Serving - Disaggregated serving guide
- LoRA Support - LoRA adapter guide