Runtime Configuration

This page documents runtime configuration options that control the TensorRT-LLM execution engine, scheduling policies, KV cache management, and performance optimizations.

Overview

Runtime configuration includes:

Executor Configuration: Controls the execution engine behavior
Scheduler Configuration: Manages request scheduling and batching
KV Cache Configuration: Controls KV cache memory management
Performance Knobs: Fine-tunes runtime performance characteristics

Scheduler Configuration

The SchedulerConfig class controls how requests are scheduled and batched.

Basic Configuration

from tensorrt_llm.llmapi import (
    SchedulerConfig,
    CapacitySchedulerPolicy,
    ContextChunkingPolicy,
)

scheduler_config = SchedulerConfig(
    capacity_scheduler_policy=CapacitySchedulerPolicy.GUARANTEED_NO_EVICT,
    context_chunking_policy=ContextChunkingPolicy.FIRST_COME_FIRST_SERVED,
)

Capacity Scheduler Policy

capacity_scheduler_policy

CapacitySchedulerPolicy

default:"GUARANTEED_NO_EVICT"

The capacity scheduler policy to use.Options:

MAX_UTILIZATION: Maximize GPU utilization by accepting as many requests as possible. May evict in-progress requests if resources are needed.
GUARANTEED_NO_EVICT: Never evict in-progress requests. Only accepts new requests if they can complete without eviction. Recommended for production.
STATIC_BATCH: Static batching mode. Batches are formed and executed atomically.

Example:

scheduler_config = SchedulerConfig(
    capacity_scheduler_policy=CapacitySchedulerPolicy.GUARANTEED_NO_EVICT
)

Context Chunking Policy

context_chunking_policy

Optional[ContextChunkingPolicy]

The context chunking policy for handling long prompts.Options:

FIRST_COME_FIRST_SERVED: Process requests in arrival order. Earlier requests get priority for chunking.
EQUAL_PROGRESS: Try to make equal progress across all requests. Balances fairness across requests.

Example:

scheduler_config = SchedulerConfig(
    context_chunking_policy=ContextChunkingPolicy.EQUAL_PROGRESS
)

Note: Only applicable when enable_chunked_prefill=True.

Dynamic Batch Configuration

dynamic_batch_config

Optional[DynamicBatchConfig]

Dynamic batch configuration. Allows runtime to automatically tune batch size and token limits.DynamicBatchConfig fields:

enable_batch_size_tuning (bool, default: true): Controls if batch size should be tuned dynamically
enable_max_num_tokens_tuning (bool, default: false): Controls if max num tokens should be tuned dynamically
dynamic_batch_moving_average_window (int, default: 128): Window size for moving average of input/output lengths

Example:

from tensorrt_llm.llmapi import DynamicBatchConfig

scheduler_config = SchedulerConfig(
    dynamic_batch_config=DynamicBatchConfig(
        enable_batch_size_tuning=True,
        enable_max_num_tokens_tuning=True,
        dynamic_batch_moving_average_window=256
    )
)

Note: This only applies to the TensorRT backend and cannot currently be used with the PyTorch backend.

Waiting Queue Policy

waiting_queue_policy

WaitingQueuePolicy

default:"FCFS"

The waiting queue scheduling policy for managing pending requests.Options:

FCFS: First-Come-First-Served

Example:

scheduler_config = SchedulerConfig(
    waiting_queue_policy="fcfs"
)

Extended Runtime Performance Knobs

ExtendedRuntimePerfKnobConfig provides fine-grained control over runtime performance optimizations.

from tensorrt_llm.llmapi import ExtendedRuntimePerfKnobConfig

perf_config = ExtendedRuntimePerfKnobConfig(
    multi_block_mode=True,
    enable_context_fmha_fp32_acc=False,
    cuda_graph_mode=True,
    cuda_graph_cache_size=16
)

Multi-Block Mode

multi_block_mode

bool

default:true

Whether to use multi-block mode for attention computation.Benefits:

Improved occupancy for long sequences
Better performance on modern GPUs

Recommendation: Keep enabled unless debugging.

Context FMHA FP32 Accumulation

enable_context_fmha_fp32_acc

bool

default:false

Whether to enable FP32 accumulation in context phase Flash Multi-Head Attention.Benefits:

Improved numerical accuracy for long context
Reduced likelihood of overflow/underflow

Trade-offs:

Slightly slower performance
Higher memory usage during attention computation

Recommendation: Enable for models with very long context or if you observe numerical issues.

CUDA Graph Mode

cuda_graph_mode

bool

default:false

Whether to use CUDA graph mode for kernel execution.Benefits:

Reduced kernel launch overhead
Improved performance for small batch sizes

Limitations:

Only applies to generation phase (not prefill)
Requires fixed shapes for captured operations

Recommendation: Enable for latency-sensitive applications.

cuda_graph_cache_size

int

default:0

Number of CUDA graphs to cache in the runtime.Benefits:

Larger cache → better performance for varying batch sizes
Avoids re-capturing graphs for common shapes

Trade-offs:

Each graph consumes ~200 MB of GPU memory

Recommendation:

Set to 8-16 for typical serving workloads
Increase if you have many unique batch sizes and spare GPU memory
Set to 0 to disable CUDA graph caching

PEFT Cache Configuration

PeftCacheConfig controls caching for PEFT adapters (LoRA).

from tensorrt_llm.llmapi import PeftCacheConfig

peft_config = PeftCacheConfig(
    num_device_module_layer=8,
    num_host_module_layer=64,
    optimal_adapter_size=16,
    max_adapter_size=128,
    device_cache_percent=0.05
)

Cache Sizing

num_device_module_layer

int

default:0

Number of max-sized 1-layer 1-module sets of weights that can be stored in device (GPU) cache.Calculation: Actual GPU memory used = num_device_module_layer × max_adapter_size × model_layer_sizeRecommendation:

Start with 8-16 for typical serving
Increase if you have many concurrent LoRA adapters

num_host_module_layer

int

default:0

Number of max-sized 1-layer 1-module sets of weights that can be stored in host (CPU) cache.Benefits:

Faster adapter swapping than loading from disk
Supports many more adapters than device cache alone

Recommendation:

Set to 32-128 for multi-tenant serving
Higher values if you have spare CPU memory

Adapter Configuration

optimal_adapter_size

int

default:8

Optimal adapter size used to set page width.Purpose:

Determines memory page size for efficient adapter storage
Should match the most common LoRA rank in your workload

Recommendation: Set to the median LoRA rank you’ll be serving (typically 8 or 16).

max_adapter_size

int

default:64

Maximum supported adapter size (LoRA rank).Purpose:

Sets upper bound on LoRA ranks
Affects minimum cache page size

Recommendation: Set to the maximum LoRA rank you need to support.

Worker Configuration

num_put_workers

int

default:1

Number of worker threads used to put weights into host cache.Recommendation: Increase to 2-4 if adapter loading becomes a bottleneck.

num_ensure_workers

int

default:1

Number of worker threads used to copy weights from host to device.Recommendation: Increase to 2-4 if adapter transfers become a bottleneck.

num_copy_streams

int

default:1

Number of CUDA streams used to copy weights from host to device.Recommendation: Increase to 2-4 for better overlap with computation.

Memory Allocation

max_pages_per_block_host

int

default:24

Number of cache pages per allocation block (host).

max_pages_per_block_device

int

default:8

Number of cache pages per allocation block (device).

device_cache_percent

float

Proportion of free device memory after engine load to use for PEFT cache (0.0 to 1.0).Default: 0.02 = 2% of free GPU memoryRecommendation:

Increase to 0.05-0.10 if serving many concurrent adapters
Decrease if GPU memory is tight

host_cache_size

int

default:1073741824

Size in bytes to use for host cache.Default: 1073741824 = 1 GBRecommendation:

Increase if serving many adapters (e.g., 8 GB or more)
Calculate based on: num_adapters × adapter_size × model_size_per_rank

Cache Transceiver Configuration

CacheTransceiverConfig controls KV cache exchange for disaggregated serving.

from tensorrt_llm.llmapi import CacheTransceiverConfig

transceiver_config = CacheTransceiverConfig(
    backend="NIXL",
    max_tokens_in_buffer=8192,
    kv_transfer_timeout_ms=10000
)

Communication Backend

backend

Optional[Literal['DEFAULT', 'UCX', 'NIXL', 'MOONCAKE', 'MPI']]

The communication backend type to use for KV cache transfer.Options:

DEFAULT: Use default backend (typically NIXL)
NIXL: NVIDIA InfiniBand eXtensions Library (recommended for InfiniBand)
UCX: Unified Communication X (alternative for InfiniBand/RoCE)
MPI: MPI-based communication
MOONCAKE: Mooncake backend (experimental)

Recommendation: Use NIXL for best performance on InfiniBand clusters.

transceiver_runtime

Optional[Literal['CPP', 'PYTHON']]

The runtime implementation.Options:

CPP: C++ transceiver (default when not set, better performance)
PYTHON: Python transceiver (easier debugging)

Recommendation: Use CPP for production.

Buffer Configuration

max_tokens_in_buffer

Optional[int]

The maximum number of tokens the transfer buffer can fit.Purpose:

Controls memory allocation for KV cache transfer buffers
Larger buffers → fewer transfers but more memory

Recommendation:

Set to 4096-8192 for typical workloads
Increase for very long sequences

Timeout Configuration

kv_transfer_timeout_ms

Optional[int]

Timeout in milliseconds for KV cache transfer.Purpose:

Requests exceeding this timeout will be cancelled
Prevents indefinite hangs on network issues

Recommendation:

Set to 5000-10000 ms for typical networks
Increase for high-latency networks or very large transfers

kv_transfer_sender_future_timeout_ms

int

default:1000

Timeout in milliseconds to wait for the sender future to be ready when scheduled batch size is 0.Purpose:

Allows requests to be eventually cancelled by the user or kv_transfer_timeout_ms

Batching Configuration

Batching Type

batching_type

Optional[BatchingType]

The batching strategy.Options:

INFLIGHT: In-flight batching (continuous batching). Requests enter and leave the batch dynamically. Recommended for throughput.
STATIC: Static batching. Batches are formed once and executed atomically. Recommended for latency-sensitive workloads.

Example:

from tensorrt_llm.llmapi import BatchingType, LlmArgs

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    batching_type=BatchingType.INFLIGHT
)

Note: This is a TensorRT backend parameter.

Example Configurations

High-Throughput Serving

from tensorrt_llm.llmapi import (
    LlmArgs,
    SchedulerConfig,
    CapacitySchedulerPolicy,
    BatchingType,
)

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    max_batch_size=256,
    max_num_tokens=8192,
    batching_type=BatchingType.INFLIGHT,
    scheduler_config=SchedulerConfig(
        capacity_scheduler_policy=CapacitySchedulerPolicy.MAX_UTILIZATION
    )
)

Low-Latency Serving

from tensorrt_llm.llmapi import (
    LlmArgs,
    ExtendedRuntimePerfKnobConfig,
    BatchingType,
)

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    max_batch_size=32,
    batching_type=BatchingType.STATIC,
    extended_runtime_perf_knob_config=ExtendedRuntimePerfKnobConfig(
        multi_block_mode=True,
        cuda_graph_mode=True,
        cuda_graph_cache_size=16
    )
)

Multi-LoRA Serving

from tensorrt_llm.llmapi import (
    LlmArgs,
    PeftCacheConfig,
    LoraConfig,
)

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    enable_lora=True,
    lora_config=LoraConfig(
        max_lora_rank=64,
        lora_dir=["/path/to/adapter1", "/path/to/adapter2"]
    ),
    peft_cache_config=PeftCacheConfig(
        num_device_module_layer=16,
        num_host_module_layer=128,
        optimal_adapter_size=16,
        max_adapter_size=64,
        device_cache_percent=0.05
    )
)

Disaggregated Serving

from tensorrt_llm.llmapi import (
    LlmArgs,
    CacheTransceiverConfig,
)

args = LlmArgs(
    model="meta-llama/Llama-2-70b-hf",
    cache_transceiver_config=CacheTransceiverConfig(
        backend="NIXL",
        max_tokens_in_buffer=8192,
        kv_transfer_timeout_ms=10000
    )
)

Python API

CLI Tools

Configuration

Runtime Configuration

Runtime Configuration

Overview

Scheduler Configuration

Basic Configuration

Capacity Scheduler Policy

Context Chunking Policy

Dynamic Batch Configuration

Waiting Queue Policy

Extended Runtime Performance Knobs

Multi-Block Mode

Context FMHA FP32 Accumulation

CUDA Graph Mode

PEFT Cache Configuration

Cache Sizing

Adapter Configuration

Worker Configuration

Memory Allocation

Cache Transceiver Configuration

Communication Backend

Buffer Configuration

Timeout Configuration

Batching Configuration

Batching Type

Example Configurations

High-Throughput Serving

Low-Latency Serving

Multi-LoRA Serving

Disaggregated Serving

See Also

Build docs developers (and LLMs) love

Python API

CLI Tools

Configuration

​Runtime Configuration

​Overview

​Scheduler Configuration

​Basic Configuration

​Capacity Scheduler Policy

​Context Chunking Policy

​Dynamic Batch Configuration

​Waiting Queue Policy

​Extended Runtime Performance Knobs

​Multi-Block Mode

​Context FMHA FP32 Accumulation

​CUDA Graph Mode

​PEFT Cache Configuration

​Cache Sizing

​Adapter Configuration

​Worker Configuration

​Memory Allocation

​Cache Transceiver Configuration

​Communication Backend

​Buffer Configuration

​Timeout Configuration

​Batching Configuration

​Batching Type

​Example Configurations

​High-Throughput Serving

​Low-Latency Serving

​Multi-LoRA Serving

​Disaggregated Serving

​See Also

Build docs developers (and LLMs) love

Runtime Configuration

Overview

Scheduler Configuration

Basic Configuration

Capacity Scheduler Policy

Context Chunking Policy

Dynamic Batch Configuration

Waiting Queue Policy

Extended Runtime Performance Knobs

Multi-Block Mode

Context FMHA FP32 Accumulation

CUDA Graph Mode

PEFT Cache Configuration

Cache Sizing

Adapter Configuration

Worker Configuration

Memory Allocation

Cache Transceiver Configuration

Communication Backend

Buffer Configuration

Timeout Configuration

Batching Configuration

Batching Type

Example Configurations

High-Throughput Serving

Low-Latency Serving

Multi-LoRA Serving

Disaggregated Serving

See Also