Skip to main content

Overview

AutoDeploy is a beta backend designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models from Hugging Face Transformers, to TensorRT-LLM. It provides an alternative deployment method using the LLM API without requiring code changes to the source model or manual implementation of inference optimizations.
This feature is under active development and currently released as beta. The code is subject to change and may include backward-incompatible updates.

How It Works

AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. Instead of manually implementing optimizations like KV-caches, multi-GPU parallelism, or quantization, AutoDeploy generates an inference-optimized graph that executes directly in the TensorRT-LLM PyTorch runtime.

Architecture

The AutoDeploy workflow follows this pipeline:
  1. PyTorch Model: Start with a standard PyTorch or Hugging Face model
  2. torch.export: Export using torch.export to generate a standard Torch graph containing core PyTorch ATen operations alongside custom attention operations
  3. Graph Transformations: Apply automated transformations including:
    • Graph sharding
    • KV-cache insertion
    • GEMM fusion
    • MHA (Multi-Head Attention) fusion
    • CudaGraph optimization
  4. Compilation: Compile using supported compile backends like torch-opt
  5. Runtime Deployment: Deploy via TensorRT-LLM runtime with in-flight batching, paging, and overlap scheduling

Key Features

Seamless Translation

Automatically converts PyTorch/Hugging Face models to TensorRT-LLM without manual rewrites

Unified Model Definition

Maintain a single source of truth with your original PyTorch/Hugging Face model

Optimized Inference

Built-in transformations for sharding, quantization, KV-cache integration, and fusion

Day-0 Support

Immediate deployment for models with continuous performance enhancements

Installation

AutoDeploy is included with the TensorRT-LLM installation:
sudo apt-get -y install libopenmpi-dev && \
pip3 install --upgrade pip setuptools && \
pip3 install tensorrt_llm
See the installation guide for more details.

Quick Start

Using the LLM API

The AutoDeploy backend can be used by specifying backend="_autodeploy" in the LLM API:
from tensorrt_llm import LLM

# Initialize with AutoDeploy backend
llm = LLM(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    backend="_autodeploy"
)

# Generate text
output = llm.generate("The future of AI is")
print(output)

Running the Demo Script

The general entry point for running AutoDeploy demos is the build_and_run_ad.py script:
cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
Checkpoints are loaded directly from Hugging Face or a local HF-like directory.

torch.export Usage

AutoDeploy uses PyTorch’s torch.export API to capture the model’s computation graph. This process:
  • Generates a standard Torch graph with core PyTorch ATen operations
  • Includes custom attention operations based on the attention backend specified in configuration
  • Creates an exportable representation that can undergo automated transformations
The attention backend is determined by the configuration, allowing flexibility in how attention operations are implemented in the exported graph.

Configuration

AutoDeploy can be configured through the standard LLM API configuration options. The backend automatically determines:
  • Attention backend for the exported graph
  • Parallelism strategies (tensor/pipeline parallelism)
  • Quantization settings
  • KV-cache architecture
See expert configurations for advanced tuning options.

Roadmap

Upcoming Model Support

  • Vision-Language Models (VLMs)
  • Structured State Space Models (SSMs) and Linear Attention architectures

Planned Features

  • Low-Rank Adaptation (LoRA)
  • Speculative Decoding for accelerated generation
Track development progress on the GitHub Project Board.

Advanced Topics

For more advanced usage:

Contributing

We welcome community contributions! See examples/auto_deploy/CONTRIBUTING.md for guidelines.

Build docs developers (and LLMs) love