Skip to main content

Runtime Configuration

This page documents runtime configuration options that control the TensorRT-LLM execution engine, scheduling policies, KV cache management, and performance optimizations.

Overview

Runtime configuration includes:
  • Executor Configuration: Controls the execution engine behavior
  • Scheduler Configuration: Manages request scheduling and batching
  • KV Cache Configuration: Controls KV cache memory management
  • Performance Knobs: Fine-tunes runtime performance characteristics

Scheduler Configuration

The SchedulerConfig class controls how requests are scheduled and batched.

Basic Configuration

from tensorrt_llm.llmapi import (
    SchedulerConfig,
    CapacitySchedulerPolicy,
    ContextChunkingPolicy,
)

scheduler_config = SchedulerConfig(
    capacity_scheduler_policy=CapacitySchedulerPolicy.GUARANTEED_NO_EVICT,
    context_chunking_policy=ContextChunkingPolicy.FIRST_COME_FIRST_SERVED,
)

Capacity Scheduler Policy

capacity_scheduler_policy
CapacitySchedulerPolicy
default:"GUARANTEED_NO_EVICT"
The capacity scheduler policy to use.Options:
  • MAX_UTILIZATION: Maximize GPU utilization by accepting as many requests as possible. May evict in-progress requests if resources are needed.
  • GUARANTEED_NO_EVICT: Never evict in-progress requests. Only accepts new requests if they can complete without eviction. Recommended for production.
  • STATIC_BATCH: Static batching mode. Batches are formed and executed atomically.
Example:
scheduler_config = SchedulerConfig(
    capacity_scheduler_policy=CapacitySchedulerPolicy.GUARANTEED_NO_EVICT
)

Context Chunking Policy

context_chunking_policy
Optional[ContextChunkingPolicy]
The context chunking policy for handling long prompts.Options:
  • FIRST_COME_FIRST_SERVED: Process requests in arrival order. Earlier requests get priority for chunking.
  • EQUAL_PROGRESS: Try to make equal progress across all requests. Balances fairness across requests.
Example:
scheduler_config = SchedulerConfig(
    context_chunking_policy=ContextChunkingPolicy.EQUAL_PROGRESS
)
Note: Only applicable when enable_chunked_prefill=True.

Dynamic Batch Configuration

dynamic_batch_config
Optional[DynamicBatchConfig]
Dynamic batch configuration. Allows runtime to automatically tune batch size and token limits.DynamicBatchConfig fields:
  • enable_batch_size_tuning (bool, default: true): Controls if batch size should be tuned dynamically
  • enable_max_num_tokens_tuning (bool, default: false): Controls if max num tokens should be tuned dynamically
  • dynamic_batch_moving_average_window (int, default: 128): Window size for moving average of input/output lengths
Example:
from tensorrt_llm.llmapi import DynamicBatchConfig

scheduler_config = SchedulerConfig(
    dynamic_batch_config=DynamicBatchConfig(
        enable_batch_size_tuning=True,
        enable_max_num_tokens_tuning=True,
        dynamic_batch_moving_average_window=256
    )
)
Note: This only applies to the TensorRT backend and cannot currently be used with the PyTorch backend.

Waiting Queue Policy

waiting_queue_policy
WaitingQueuePolicy
default:"FCFS"
The waiting queue scheduling policy for managing pending requests.Options:
  • FCFS: First-Come-First-Served
Example:
scheduler_config = SchedulerConfig(
    waiting_queue_policy="fcfs"
)

Extended Runtime Performance Knobs

ExtendedRuntimePerfKnobConfig provides fine-grained control over runtime performance optimizations.
from tensorrt_llm.llmapi import ExtendedRuntimePerfKnobConfig

perf_config = ExtendedRuntimePerfKnobConfig(
    multi_block_mode=True,
    enable_context_fmha_fp32_acc=False,
    cuda_graph_mode=True,
    cuda_graph_cache_size=16
)

Multi-Block Mode

multi_block_mode
bool
default:true
Whether to use multi-block mode for attention computation.Benefits:
  • Improved occupancy for long sequences
  • Better performance on modern GPUs
Recommendation: Keep enabled unless debugging.

Context FMHA FP32 Accumulation

enable_context_fmha_fp32_acc
bool
default:false
Whether to enable FP32 accumulation in context phase Flash Multi-Head Attention.Benefits:
  • Improved numerical accuracy for long context
  • Reduced likelihood of overflow/underflow
Trade-offs:
  • Slightly slower performance
  • Higher memory usage during attention computation
Recommendation: Enable for models with very long context or if you observe numerical issues.

CUDA Graph Mode

cuda_graph_mode
bool
default:false
Whether to use CUDA graph mode for kernel execution.Benefits:
  • Reduced kernel launch overhead
  • Improved performance for small batch sizes
Limitations:
  • Only applies to generation phase (not prefill)
  • Requires fixed shapes for captured operations
Recommendation: Enable for latency-sensitive applications.
cuda_graph_cache_size
int
default:0
Number of CUDA graphs to cache in the runtime.Benefits:
  • Larger cache → better performance for varying batch sizes
  • Avoids re-capturing graphs for common shapes
Trade-offs:
  • Each graph consumes ~200 MB of GPU memory
Recommendation:
  • Set to 8-16 for typical serving workloads
  • Increase if you have many unique batch sizes and spare GPU memory
  • Set to 0 to disable CUDA graph caching

PEFT Cache Configuration

PeftCacheConfig controls caching for PEFT adapters (LoRA).
from tensorrt_llm.llmapi import PeftCacheConfig

peft_config = PeftCacheConfig(
    num_device_module_layer=8,
    num_host_module_layer=64,
    optimal_adapter_size=16,
    max_adapter_size=128,
    device_cache_percent=0.05
)

Cache Sizing

num_device_module_layer
int
default:0
Number of max-sized 1-layer 1-module sets of weights that can be stored in device (GPU) cache.Calculation: Actual GPU memory used = num_device_module_layer × max_adapter_size × model_layer_sizeRecommendation:
  • Start with 8-16 for typical serving
  • Increase if you have many concurrent LoRA adapters
num_host_module_layer
int
default:0
Number of max-sized 1-layer 1-module sets of weights that can be stored in host (CPU) cache.Benefits:
  • Faster adapter swapping than loading from disk
  • Supports many more adapters than device cache alone
Recommendation:
  • Set to 32-128 for multi-tenant serving
  • Higher values if you have spare CPU memory

Adapter Configuration

optimal_adapter_size
int
default:8
Optimal adapter size used to set page width.Purpose:
  • Determines memory page size for efficient adapter storage
  • Should match the most common LoRA rank in your workload
Recommendation: Set to the median LoRA rank you’ll be serving (typically 8 or 16).
max_adapter_size
int
default:64
Maximum supported adapter size (LoRA rank).Purpose:
  • Sets upper bound on LoRA ranks
  • Affects minimum cache page size
Recommendation: Set to the maximum LoRA rank you need to support.

Worker Configuration

num_put_workers
int
default:1
Number of worker threads used to put weights into host cache.Recommendation: Increase to 2-4 if adapter loading becomes a bottleneck.
num_ensure_workers
int
default:1
Number of worker threads used to copy weights from host to device.Recommendation: Increase to 2-4 if adapter transfers become a bottleneck.
num_copy_streams
int
default:1
Number of CUDA streams used to copy weights from host to device.Recommendation: Increase to 2-4 for better overlap with computation.

Memory Allocation

max_pages_per_block_host
int
default:24
Number of cache pages per allocation block (host).
max_pages_per_block_device
int
default:8
Number of cache pages per allocation block (device).
device_cache_percent
float
Proportion of free device memory after engine load to use for PEFT cache (0.0 to 1.0).Default: 0.02 = 2% of free GPU memoryRecommendation:
  • Increase to 0.05-0.10 if serving many concurrent adapters
  • Decrease if GPU memory is tight
host_cache_size
int
default:1073741824
Size in bytes to use for host cache.Default: 1073741824 = 1 GBRecommendation:
  • Increase if serving many adapters (e.g., 8 GB or more)
  • Calculate based on: num_adapters × adapter_size × model_size_per_rank

Cache Transceiver Configuration

CacheTransceiverConfig controls KV cache exchange for disaggregated serving.
from tensorrt_llm.llmapi import CacheTransceiverConfig

transceiver_config = CacheTransceiverConfig(
    backend="NIXL",
    max_tokens_in_buffer=8192,
    kv_transfer_timeout_ms=10000
)

Communication Backend

backend
Optional[Literal['DEFAULT', 'UCX', 'NIXL', 'MOONCAKE', 'MPI']]
The communication backend type to use for KV cache transfer.Options:
  • DEFAULT: Use default backend (typically NIXL)
  • NIXL: NVIDIA InfiniBand eXtensions Library (recommended for InfiniBand)
  • UCX: Unified Communication X (alternative for InfiniBand/RoCE)
  • MPI: MPI-based communication
  • MOONCAKE: Mooncake backend (experimental)
Recommendation: Use NIXL for best performance on InfiniBand clusters.
transceiver_runtime
Optional[Literal['CPP', 'PYTHON']]
The runtime implementation.Options:
  • CPP: C++ transceiver (default when not set, better performance)
  • PYTHON: Python transceiver (easier debugging)
Recommendation: Use CPP for production.

Buffer Configuration

max_tokens_in_buffer
Optional[int]
The maximum number of tokens the transfer buffer can fit.Purpose:
  • Controls memory allocation for KV cache transfer buffers
  • Larger buffers → fewer transfers but more memory
Recommendation:
  • Set to 4096-8192 for typical workloads
  • Increase for very long sequences

Timeout Configuration

kv_transfer_timeout_ms
Optional[int]
Timeout in milliseconds for KV cache transfer.Purpose:
  • Requests exceeding this timeout will be cancelled
  • Prevents indefinite hangs on network issues
Recommendation:
  • Set to 5000-10000 ms for typical networks
  • Increase for high-latency networks or very large transfers
kv_transfer_sender_future_timeout_ms
int
default:1000
Timeout in milliseconds to wait for the sender future to be ready when scheduled batch size is 0.Purpose:
  • Allows requests to be eventually cancelled by the user or kv_transfer_timeout_ms

Batching Configuration

Batching Type

batching_type
Optional[BatchingType]
The batching strategy.Options:
  • INFLIGHT: In-flight batching (continuous batching). Requests enter and leave the batch dynamically. Recommended for throughput.
  • STATIC: Static batching. Batches are formed once and executed atomically. Recommended for latency-sensitive workloads.
Example:
from tensorrt_llm.llmapi import BatchingType, LlmArgs

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    batching_type=BatchingType.INFLIGHT
)
Note: This is a TensorRT backend parameter.

Example Configurations

High-Throughput Serving

from tensorrt_llm.llmapi import (
    LlmArgs,
    SchedulerConfig,
    CapacitySchedulerPolicy,
    BatchingType,
)

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    max_batch_size=256,
    max_num_tokens=8192,
    batching_type=BatchingType.INFLIGHT,
    scheduler_config=SchedulerConfig(
        capacity_scheduler_policy=CapacitySchedulerPolicy.MAX_UTILIZATION
    )
)

Low-Latency Serving

from tensorrt_llm.llmapi import (
    LlmArgs,
    ExtendedRuntimePerfKnobConfig,
    BatchingType,
)

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    max_batch_size=32,
    batching_type=BatchingType.STATIC,
    extended_runtime_perf_knob_config=ExtendedRuntimePerfKnobConfig(
        multi_block_mode=True,
        cuda_graph_mode=True,
        cuda_graph_cache_size=16
    )
)

Multi-LoRA Serving

from tensorrt_llm.llmapi import (
    LlmArgs,
    PeftCacheConfig,
    LoraConfig,
)

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    enable_lora=True,
    lora_config=LoraConfig(
        max_lora_rank=64,
        lora_dir=["/path/to/adapter1", "/path/to/adapter2"]
    ),
    peft_cache_config=PeftCacheConfig(
        num_device_module_layer=16,
        num_host_module_layer=128,
        optimal_adapter_size=16,
        max_adapter_size=64,
        device_cache_percent=0.05
    )
)

Disaggregated Serving

from tensorrt_llm.llmapi import (
    LlmArgs,
    CacheTransceiverConfig,
)

args = LlmArgs(
    model="meta-llama/Llama-2-70b-hf",
    cache_transceiver_config=CacheTransceiverConfig(
        backend="NIXL",
        max_tokens_in_buffer=8192,
        kv_transfer_timeout_ms=10000
    )
)

See Also

Build docs developers (and LLMs) love