High-Level Architecture
The system follows a three-layer architecture:The PyTorch backend is the default and recommended backend for most use cases. It provides the best balance of performance and flexibility.
Core Components
TensorRT-LLM’s architecture is built around several key components that work together to deliver high-performance inference:LLM API (Entry Point)
TheLLM class in tensorrt_llm/llmapi/llm.py serves as the main entry point for users:
- Tokenization: Converting input text to token IDs
- Detokenization: Converting output token IDs back to text
- Backend selection: Choosing the appropriate execution backend
- Model loading: Loading model weights and configuration
Executor Layer
The executor layer is responsible for managing the execution of inference requests. Different backends use different executors:PyExecutor (PyTorch Backend)
PyExecutor (PyTorch Backend)
The
PyExecutor creates a dedicated worker process on each GPU rank and operates in a continuous background loop to process inference requests asynchronously.Location: tensorrt_llm/_torch/pyexecutor/py_executor.pyKey responsibilities:- Fetches new inference requests from the request queue
- Schedules requests for execution
- Manages model forward passes
- Coordinates with the decoder for token generation
TensorRT Executor (Legacy)
TensorRT Executor (Legacy)
The TensorRT backend uses compiled TensorRT engines for maximum performance. This is the legacy backend, maintained for backward compatibility.Entry point:
LLM(backend="tensorrt")Path: builder.py → trtllm.Executor → TensorRT EngineADExecutor (AutoDeploy - Beta)
ADExecutor (AutoDeploy - Beta)
AutoDeploy automatically converts PyTorch/HuggingFace models to optimized TensorRT-LLM inference graphs through automated graph transformations.Entry point:
LLM(backend="_autodeploy")Path: _torch/auto_deploy/ → ADExecutor → graph transforms + torch.exportShared C++ Core
All backends share highly optimized C++ components (exposed via Nanobind bindings) for critical runtime operations:Scheduling Pipeline
- Scheduler: Determines which requests can be executed
- Batch Manager: Implements in-flight batching (continuous batching)
- KV Cache Manager: Allocates and manages key-value cache blocks
Decoding Pipeline
- Decoder: Orchestrates token generation
- Sampler: Applies sampling strategies (greedy, top-k, top-p, beam search)
Request Flow
Understanding how a request flows through the system helps clarify the role of each component:PyExecutor Iteration Loop
The PyExecutor operates in a continuous loop, processing batches of requests:The Overlap Scheduler optimization allows CPU tasks (like checking stop criteria) to run concurrently with GPU computation, maximizing throughput. See Optimization Techniques for details.
Key Configuration Files
Understanding the main source files helps navigate the codebase:| File | Purpose |
|---|---|
tensorrt_llm/llmapi/llm.py | Main LLM API entry point |
tensorrt_llm/llmapi/llm_args.py | Complete configuration schema (Pydantic-based) |
tensorrt_llm/llmapi/llm_utils.py | Model loading and model-specific defaults |
tensorrt_llm/_torch/pyexecutor/py_executor.py | PyExecutor implementation |
tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py | Request scheduler |
tensorrt_llm/_torch/pyexecutor/resource_manager.py | KV cache and resource management |
tensorrt_llm/_torch/pyexecutor/model_engine.py | Model forward pass execution |
tensorrt_llm/_torch/pyexecutor/sampler.py | Token sampling logic |
Model Architecture Pattern
All models in TensorRT-LLM follow a consistent pattern:Config Class
Each model has a
Config class that inherits from PretrainedConfigExample: LlamaConfig in tensorrt_llm/models/llama/config.pyForCausalLM Class
Each model implements a
ForCausalLM class (e.g., LlamaForCausalLM) that inherits from PretrainedModelThis class contains the actual model implementationDistributed Execution
TensorRT-LLM supports distributed inference across multiple GPUs:- Tensor Parallelism: Split individual layers across GPUs
- Pipeline Parallelism: Distribute layers across GPUs
- Communication Backends: MPI, Ray, or RPC
- Mapping Class: The
Mappingclass intensorrt_llm/mapping.pyhandles the distribution strategy
Serving Architecture
For production deployments, TensorRT-LLM providestrtllm-serve:
- OpenAI-compatible REST + gRPC server
- Supports all backends (PyTorch, TensorRT, AutoDeploy)
- Disaggregated serving: Separates prefill (context processing) and decode (generation) across different GPU pools
- KV cache exchange via NIXL (default), UCX, or MPI
- Optimizes resource utilization for different workload characteristics
Backend Selection
Learn when to use PyTorch, TensorRT, or AutoDeploy backends
Optimization Techniques
Explore in-flight batching, paged KV cache, and CUDA graphs