System Architecture

TensorRT-LLM is designed with a layered architecture that combines a Python API frontend with highly optimized execution backends. This design enables both ease of use and maximum performance for LLM inference on NVIDIA GPUs.

High-Level Architecture

The system follows a three-layer architecture:

The PyTorch backend is the default and recommended backend for most use cases. It provides the best balance of performance and flexibility.

Core Components

TensorRT-LLM’s architecture is built around several key components that work together to deliver high-performance inference:

LLM API (Entry Point)

The LLM class in tensorrt_llm/llmapi/llm.py serves as the main entry point for users:

from tensorrt_llm import LLM

# Initialize with any HuggingFace model
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Generate text
output = llm.generate("Hello, my name is")

The LLM API automatically handles:

Tokenization: Converting input text to token IDs
Detokenization: Converting output token IDs back to text
Backend selection: Choosing the appropriate execution backend
Model loading: Loading model weights and configuration

Executor Layer

The executor layer is responsible for managing the execution of inference requests. Different backends use different executors:

PyExecutor (PyTorch Backend)

The PyExecutor creates a dedicated worker process on each GPU rank and operates in a continuous background loop to process inference requests asynchronously.Location: tensorrt_llm/_torch/pyexecutor/py_executor.pyKey responsibilities:

Fetches new inference requests from the request queue
Schedules requests for execution
Manages model forward passes
Coordinates with the decoder for token generation

TensorRT Executor (Legacy)

The TensorRT backend uses compiled TensorRT engines for maximum performance. This is the legacy backend, maintained for backward compatibility.Entry point: LLM(backend="tensorrt")Path: builder.py → trtllm.Executor → TensorRT Engine

ADExecutor (AutoDeploy - Beta)

AutoDeploy automatically converts PyTorch/HuggingFace models to optimized TensorRT-LLM inference graphs through automated graph transformations.Entry point: LLM(backend="_autodeploy")Path: _torch/auto_deploy/ → ADExecutor → graph transforms + torch.export

Shared C++ Core

All backends share highly optimized C++ components (exposed via Nanobind bindings) for critical runtime operations:

Scheduling Pipeline

Scheduler: Determines which requests can be executed
Batch Manager: Implements in-flight batching (continuous batching)
KV Cache Manager: Allocates and manages key-value cache blocks

Decoding Pipeline

Decoder: Orchestrates token generation
Sampler: Applies sampling strategies (greedy, top-k, top-p, beam search)

Request Flow

Understanding how a request flows through the system helps clarify the role of each component:

┌─────────────────────────────────────────────────────────────────┐
│ 1. User submits prompt to LLM API                               │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. Tokenization (text → token IDs)                              │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. Executor receives request                                     │
│    (PyExecutor / TensorRT Executor / ADExecutor)                │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. Scheduler decides when to process request                    │
│    - Checks available KV cache blocks                           │
│    - Applies max_batch_size and max_num_tokens limits           │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 5. Model Forward Pass                                            │
│    - Context phase: Process all prompt tokens                   │
│    - Generation phase: Process one token per step               │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 6. Decoder generates next token                                  │
│    - Receives logits from model                                 │
│    - Sampler applies sampling strategy                          │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 7. Update state and check completion                            │
│    - Add token to KV cache                                      │
│    - Check stop criteria (EOS, max_length)                      │
│    - Return to step 4 if not finished                           │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 8. Detokenization (token IDs → text)                            │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 9. Return result to user                                         │
└─────────────────────────────────────────────────────────────────┘

PyExecutor Iteration Loop

The PyExecutor operates in a continuous loop, processing batches of requests:

# Simplified PyExecutor iteration (from py_executor.py)
while not done:
    # 1. Fetch new requests from queue
    new_requests = request_queue.get_new_requests()
    
    # 2. Schedule requests for this step
    scheduled_batch = scheduler.schedule(
        active_requests=active_requests,
        available_kv_blocks=kv_cache_manager.available_blocks()
    )
    
    # 3. Prepare KV cache resources
    kv_cache_manager.prepare_resources(scheduled_batch)
    
    # 4. Run model forward pass
    logits = model_engine.forward(scheduled_batch)
    
    # 5. Sample next tokens
    next_tokens = sampler.sample(logits, scheduled_batch)
    
    # 6. Update requests and check completion
    for request in scheduled_batch:
        request.add_token(next_tokens[request.id])
        if request.is_finished():
            return_to_user(request)
            active_requests.remove(request)
            kv_cache_manager.free_resources(request)

The Overlap Scheduler optimization allows CPU tasks (like checking stop criteria) to run concurrently with GPU computation, maximizing throughput. See Optimization Techniques for details.

Key Configuration Files

Understanding the main source files helps navigate the codebase:

File	Purpose
`tensorrt_llm/llmapi/llm.py`	Main LLM API entry point
`tensorrt_llm/llmapi/llm_args.py`	Complete configuration schema (Pydantic-based)
`tensorrt_llm/llmapi/llm_utils.py`	Model loading and model-specific defaults
`tensorrt_llm/_torch/pyexecutor/py_executor.py`	PyExecutor implementation
`tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py`	Request scheduler
`tensorrt_llm/_torch/pyexecutor/resource_manager.py`	KV cache and resource management
`tensorrt_llm/_torch/pyexecutor/model_engine.py`	Model forward pass execution
`tensorrt_llm/_torch/pyexecutor/sampler.py`	Token sampling logic

Model Architecture Pattern

All models in TensorRT-LLM follow a consistent pattern:

Config Class

Each model has a Config class that inherits from PretrainedConfigExample: LlamaConfig in tensorrt_llm/models/llama/config.py

ForCausalLM Class

Each model implements a ForCausalLM class (e.g., LlamaForCausalLM) that inherits from PretrainedModelThis class contains the actual model implementation

Auto-Registration

Models self-register via automodel.py for automatic discoveryThe system uses the HuggingFace config’s architectures field to find the right model

Distributed Execution

TensorRT-LLM supports distributed inference across multiple GPUs:

Tensor Parallelism: Split individual layers across GPUs
Pipeline Parallelism: Distribute layers across GPUs
Communication Backends: MPI, Ray, or RPC
Mapping Class: The Mapping class in tensorrt_llm/mapping.py handles the distribution strategy

Serving Architecture

For production deployments, TensorRT-LLM provides trtllm-serve:

OpenAI-compatible REST + gRPC server
Supports all backends (PyTorch, TensorRT, AutoDeploy)
Disaggregated serving: Separates prefill (context processing) and decode (generation) across different GPU pools
- KV cache exchange via NIXL (default), UCX, or MPI
- Optimizes resource utilization for different workload characteristics

Backend Selection

Learn when to use PyTorch, TensorRT, or AutoDeploy backends

Optimization Techniques

Explore in-flight batching, paged KV cache, and CUDA graphs

Get Started

Core Concepts

Deployment

Models

Features

Performance

High-Level Architecture

Core Components

LLM API (Entry Point)

Executor Layer

Shared C++ Core

Scheduling Pipeline

Decoding Pipeline

Request Flow

PyExecutor Iteration Loop

Key Configuration Files

Model Architecture Pattern

Distributed Execution

Serving Architecture

Backend Selection

Optimization Techniques

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

Models

Features

Performance

​High-Level Architecture

​Core Components

​LLM API (Entry Point)

​Executor Layer

​Shared C++ Core

Scheduling Pipeline

Decoding Pipeline

​Request Flow

​PyExecutor Iteration Loop

​Key Configuration Files

​Model Architecture Pattern

​Distributed Execution

​Serving Architecture

Backend Selection

Optimization Techniques

Build docs developers (and LLMs) love

High-Level Architecture

Core Components

LLM API (Entry Point)

Executor Layer

Shared C++ Core

Request Flow

PyExecutor Iteration Loop

Key Configuration Files

Model Architecture Pattern

Distributed Execution

Serving Architecture