Backend Comparison
PyTorch
Default - Recommended for most usersBest balance of performance and flexibility
TensorRT
Legacy - Maintained for compatibilityCompiled TensorRT engines
AutoDeploy
Beta - ExperimentalAutomatic model optimization
Backend Overview Table
| Feature | PyTorch | TensorRT | AutoDeploy |
|---|---|---|---|
| Status | Default ✅ | Legacy | Beta (experimental) |
| Entry Point | LLM(backend="pytorch") | LLM(backend="tensorrt") | LLM(backend="_autodeploy") |
| Key Path | _torch/pyexecutor/ → PyExecutor | builder.py → trtllm.Executor | _torch/auto_deploy/ → ADExecutor |
| Performance | Excellent | Maximum | Good (improving) |
| Flexibility | High | Low | Very High |
| Build Time | None (dynamic) | Long (compilation) | Medium (graph transforms) |
| New Model Support | Requires implementation | Requires implementation | Day-0 support |
| Recommended For | Production, development | Legacy workloads | Prototyping, new models |
PyTorch Backend (Default)
The PyTorch backend is the default and recommended backend for TensorRT-LLM. It combines excellent performance with maximum flexibility.Architecture
Key Features
Dynamic Execution
Dynamic Execution
- No compilation step required
- Immediate model loading and inference
- Easy debugging with standard PyTorch tools
- Supports
torch.compilefor additional optimization
Custom Attention Kernels
Custom Attention Kernels
The PyTorch backend uses highly optimized custom attention implementations:
- TrtllmAttention (default): Hand-tuned CUDA kernels for maximum performance
- FlashInferAttention: Alternative backend with FP8 quantization support
- VanillaAttention: Reference implementation for testing
LLM(attn_backend="trtllm") or LLM(attn_backend="flashinfer")Full Feature Support
Full Feature Support
- In-flight batching (continuous batching)
- Paged KV cache with cross-request reuse
- Speculative decoding (EAGLE, Medusa, n-gram, etc.)
- LoRA adapters with dynamic switching
- Multi-modal models (vision-language)
- Quantization (FP8, INT8, INT4)
- CUDA Graphs
- Overlap scheduler
Distributed Inference
Distributed Inference
- Tensor parallelism
- Pipeline parallelism
- Multiple communication backends (MPI, Ray, RPC)
- Disaggregated serving (separate prefill and decode)
When to Use PyTorch Backend
Use the PyTorch backend when:
- Starting a new project (it’s the default)
- You need rapid iteration and development
- You want the latest features and optimizations
- You need to debug model behavior
- You’re deploying to production (recommended)
Example Usage
Source Location: All PyTorch backend code is in
tensorrt_llm/_torch/Key files:_torch/pyexecutor/py_executor.py- Main executor_torch/pyexecutor/model_engine.py- Model execution_torch/attention_backend/- Attention implementations
TensorRT Backend (Legacy)
The TensorRT backend uses compiled TensorRT engines for inference. This backend is considered legacy and is maintained primarily for backward compatibility.Architecture
Key Characteristics
Advantages:- Maximum theoretical performance through aggressive optimization
- Highly optimized kernel fusion
- Efficient memory usage
- Long build times (30+ minutes for large models)
- Hardware-specific engines (cannot transfer between GPU types)
- Limited flexibility (cannot modify model after compilation)
- Slower to adopt new features
- Difficult to debug
When to Use TensorRT Backend
Use the TensorRT backend only when:
- You have an existing deployment using TensorRT engines
- You need to maintain backward compatibility
- You have very specific performance requirements that PyTorch backend doesn’t meet
Example Usage
AutoDeploy Backend (Beta)
AutoDeploy is an experimental backend that automatically optimizes PyTorch/HuggingFace models for inference through automated graph transformations. It requires no manual model implementation.Status: Beta - Under active development. The API may change in future releases.
Architecture
Key Features
Zero Code Changes
Works with unmodified PyTorch/HuggingFace modelsNo manual kernel implementation required
Day-0 Model Support
Support new model architectures immediatelyGreat for prototyping and experimentation
Automated Optimization
Automatic graph transformations:
- Sharding for multi-GPU
- KV cache integration
- Attention fusion
- Quantization
- CUDA Graph optimization
Single Source of Truth
Maintain your original PyTorch modelNo need for separate inference implementations
Workflow
Graph Transformation
Applies automated transformations:
- Graph sharding for tensor parallelism
- KV cache block insertion
- GEMM fusion
- Custom attention operator replacement
When to Use AutoDeploy Backend
Use the AutoDeploy backend when:
- You’re working with a new model architecture not yet supported in TensorRT-LLM
- You need rapid prototyping and experimentation
- You want to deploy a custom PyTorch model without manual optimization
- You’re evaluating whether to invest in a full TensorRT-LLM implementation
Example Usage
Example: Custom Model
Source Location: AutoDeploy code is in
tensorrt_llm/_torch/auto_deploy/Roadmap:- Vision-Language Models (VLMs)
- State Space Models (SSMs)
- LoRA support
- Speculative decoding
Choosing the Right Backend
Decision Flow
Performance Comparison
In most cases, the PyTorch backend provides performance within 5-10% of the TensorRT backend, without any compilation overhead. For many workloads, especially with CUDA Graphs enabled, the PyTorch backend matches or exceeds TensorRT backend performance.
Benchmark Example (Llama-2-7B on H100)
| Backend | Throughput (tokens/s) | Build Time | Flexibility |
|---|---|---|---|
| PyTorch | 12,500 | None | High |
| TensorRT | 13,000 | 45 min | Low |
| AutoDeploy | 10,000 | 5 min | Very High |
Actual performance depends on many factors: model architecture, batch size, sequence length, hardware, and configuration parameters. Always benchmark with your specific workload.
Shared Features Across All Backends
All three backends benefit from the Shared C++ Core components:- Scheduler: In-flight batching and request scheduling
- KV Cache Manager: Paged memory management with cross-request reuse
- Batch Manager: Dynamic batching optimization
- Decoder: Token generation orchestration
- Sampler: Sampling strategies (greedy, top-k, top-p, beam search)
System Architecture
Learn about the overall system design
Optimization Techniques
Explore advanced performance optimizations