TensorRT Execution Provider
The TensorRT Execution Provider delivers maximum inference performance on NVIDIA GPUs by leveraging NVIDIA TensorRT, a high-performance deep learning inference optimizer and runtime.When to Use TensorRT EP
Use the TensorRT Execution Provider when:- You need maximum performance on NVIDIA GPUs
- Your model is finalized and ready for production
- You can tolerate longer initial load times for faster inference
- You want to use FP16 or INT8 precision for better performance
- Your deployment uses fixed or limited input shapes
Key Features
- Advanced Optimizations: Layer fusion, kernel auto-tuning, precision calibration
- Mixed Precision: FP32, FP16, INT8, BF16 support
- Dynamic Shapes: Handle variable input shapes with optimization profiles
- Engine Caching: Save optimized engines to disk for faster startup
- DLA Support: Offload to Deep Learning Accelerator (Jetson, Drive platforms)
Prerequisites
Hardware Requirements
- NVIDIA GPU with compute capability 6.0 or higher
- Recommended: 6GB+ GPU memory
Software Requirements
- TensorRT: 8.6.x or 10.x
- CUDA Toolkit: 11.8 or 12.x
- cuDNN: 8.x or 9.x
- ONNX Runtime TensorRT package
Installation
Python
Docker (Recommended)
C++
Download the TensorRT-enabled build from ONNX Runtime releases.Basic Usage
Python
C++
C#
Configuration Options
Python Provider Options
Key Configuration Parameters
Precision Modes
FP16 (Half Precision)
Best balance of speed and accuracy:INT8 (8-bit Integer)
Maximum performance with calibration:BF16 (Brain Float16)
For NVIDIA Ampere and newer:Engine Caching
Save optimized engines to avoid rebuild:- Dramatically faster session creation (seconds vs minutes)
- Consistent performance across runs
- Required for production deployments
Dynamic Shapes
Optimize for variable input sizes:Builder Optimization Level
Control build time vs runtime performance trade-off:Performance Optimization
INT8 Calibration
For INT8 quantization, you need a calibration cache:Timing Cache
Speed up engine building:Context Memory Sharing
Reduce memory usage with multiple engines:Auxiliary Streams
Control parallelism:Production Deployment
Engine Serialization
Save and load optimized engines:EP Context Model
Embed TensorRT engine in ONNX model:Platform Support
| Platform | Support | Notes |
|---|---|---|
| Linux x64 | ✅ Full | Best support |
| Windows x64 | ✅ Full | Full features |
| Linux ARM64 | ✅ Full | Jetson, AWS Graviton |
| Windows ARM64 | ❌ No | Not supported |
| macOS | ❌ No | NVIDIA GPU required |
Supported Hardware
Data Center
- H100 (Hopper) - Best performance
- A100, A40, A30, A10 (Ampere)
- V100 (Volta)
- T4 (Turing)
Desktop
- RTX 40 Series (Ada Lovelace)
- RTX 30 Series (Ampere)
- RTX 20 Series (Turing)
- GTX 16 Series (Turing)
Edge/Embedded
- Jetson AGX Orin (with DLA)
- Jetson Orin Nano/NX
- Jetson Xavier AGX/NX (with DLA)
- NVIDIA Drive (with DLA)
Troubleshooting
Engine Build Failures
Unsupported Operators
Some operators fall back to CUDA:Precision Issues
If FP16/INT8 causes accuracy problems:Performance Comparison
Typical speedup over CPU (varies by model):| Precision | Speedup | Accuracy Impact |
|---|---|---|
| FP32 | 5-10x | None |
| FP16 | 10-20x | Minimal (less than 0.5%) |
| INT8 | 20-40x | Small (1-3%) with calibration |
Next Steps
- Compare with CUDA Execution Provider
- Learn about model optimization
- Explore performance tuning