Overview
AutoDeploy is a beta backend designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models from Hugging Face Transformers, to TensorRT-LLM. It provides an alternative deployment method using the LLM API without requiring code changes to the source model or manual implementation of inference optimizations.How It Works
AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. Instead of manually implementing optimizations like KV-caches, multi-GPU parallelism, or quantization, AutoDeploy generates an inference-optimized graph that executes directly in the TensorRT-LLM PyTorch runtime.Architecture
The AutoDeploy workflow follows this pipeline:- PyTorch Model: Start with a standard PyTorch or Hugging Face model
- torch.export: Export using
torch.exportto generate a standard Torch graph containing core PyTorch ATen operations alongside custom attention operations - Graph Transformations: Apply automated transformations including:
- Graph sharding
- KV-cache insertion
- GEMM fusion
- MHA (Multi-Head Attention) fusion
- CudaGraph optimization
- Compilation: Compile using supported compile backends like
torch-opt - Runtime Deployment: Deploy via TensorRT-LLM runtime with in-flight batching, paging, and overlap scheduling
Key Features
Seamless Translation
Automatically converts PyTorch/Hugging Face models to TensorRT-LLM without manual rewrites
Unified Model Definition
Maintain a single source of truth with your original PyTorch/Hugging Face model
Optimized Inference
Built-in transformations for sharding, quantization, KV-cache integration, and fusion
Day-0 Support
Immediate deployment for models with continuous performance enhancements
Installation
AutoDeploy is included with the TensorRT-LLM installation:Quick Start
Using the LLM API
The AutoDeploy backend can be used by specifyingbackend="_autodeploy" in the LLM API:
Running the Demo Script
The general entry point for running AutoDeploy demos is thebuild_and_run_ad.py script:
torch.export Usage
AutoDeploy uses PyTorch’storch.export API to capture the model’s computation graph. This process:
- Generates a standard Torch graph with core PyTorch ATen operations
- Includes custom attention operations based on the attention backend specified in configuration
- Creates an exportable representation that can undergo automated transformations
Configuration
AutoDeploy can be configured through the standard LLM API configuration options. The backend automatically determines:- Attention backend for the exported graph
- Parallelism strategies (tensor/pipeline parallelism)
- Quantization settings
- KV-cache architecture
Roadmap
Upcoming Model Support
- Vision-Language Models (VLMs)
- Structured State Space Models (SSMs) and Linear Attention architectures
Planned Features
- Low-Rank Adaptation (LoRA)
- Speculative Decoding for accelerated generation
Advanced Topics
For more advanced usage:- Example Run Script
- Logging Configuration
- Workflow Integration
- Performance Benchmarking
- KV Cache Architecture
- Export ONNX for EdgeLLM
Contributing
We welcome community contributions! Seeexamples/auto_deploy/CONTRIBUTING.md for guidelines.