AutoDeploy Backend

Overview

AutoDeploy is a beta backend designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models from Hugging Face Transformers, to TensorRT-LLM. It provides an alternative deployment method using the LLM API without requiring code changes to the source model or manual implementation of inference optimizations.

This feature is under active development and currently released as beta. The code is subject to change and may include backward-incompatible updates.

How It Works

AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. Instead of manually implementing optimizations like KV-caches, multi-GPU parallelism, or quantization, AutoDeploy generates an inference-optimized graph that executes directly in the TensorRT-LLM PyTorch runtime.

Architecture

The AutoDeploy workflow follows this pipeline:

PyTorch Model: Start with a standard PyTorch or Hugging Face model
torch.export: Export using torch.export to generate a standard Torch graph containing core PyTorch ATen operations alongside custom attention operations
Graph Transformations: Apply automated transformations including:
- Graph sharding
- KV-cache insertion
- GEMM fusion
- MHA (Multi-Head Attention) fusion
- CudaGraph optimization
Compilation: Compile using supported compile backends like torch-opt
Runtime Deployment: Deploy via TensorRT-LLM runtime with in-flight batching, paging, and overlap scheduling

Key Features

Seamless Translation

Automatically converts PyTorch/Hugging Face models to TensorRT-LLM without manual rewrites

Unified Model Definition

Maintain a single source of truth with your original PyTorch/Hugging Face model

Optimized Inference

Built-in transformations for sharding, quantization, KV-cache integration, and fusion

Day-0 Support

Immediate deployment for models with continuous performance enhancements

Installation

AutoDeploy is included with the TensorRT-LLM installation:

sudo apt-get -y install libopenmpi-dev && \
pip3 install --upgrade pip setuptools && \
pip3 install tensorrt_llm

See the installation guide for more details.

Quick Start

Using the LLM API

The AutoDeploy backend can be used by specifying backend="_autodeploy" in the LLM API:

from tensorrt_llm import LLM

# Initialize with AutoDeploy backend
llm = LLM(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    backend="_autodeploy"
)

# Generate text
output = llm.generate("The future of AI is")
print(output)

Running the Demo Script

The general entry point for running AutoDeploy demos is the build_and_run_ad.py script:

cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Checkpoints are loaded directly from Hugging Face or a local HF-like directory.

torch.export Usage

AutoDeploy uses PyTorch’s torch.export API to capture the model’s computation graph. This process:

Generates a standard Torch graph with core PyTorch ATen operations
Includes custom attention operations based on the attention backend specified in configuration
Creates an exportable representation that can undergo automated transformations

The attention backend is determined by the configuration, allowing flexibility in how attention operations are implemented in the exported graph.

Configuration

AutoDeploy can be configured through the standard LLM API configuration options. The backend automatically determines:

Attention backend for the exported graph
Parallelism strategies (tensor/pipeline parallelism)
Quantization settings
KV-cache architecture

See expert configurations for advanced tuning options.

Roadmap

Upcoming Model Support

Vision-Language Models (VLMs)
Structured State Space Models (SSMs) and Linear Attention architectures

Planned Features

Low-Rank Adaptation (LoRA)
Speculative Decoding for accelerated generation

Track development progress on the GitHub Project Board.

Advanced Topics

For more advanced usage:

Contributing

We welcome community contributions! See examples/auto_deploy/CONTRIBUTING.md for guidelines.

Contributing

Extending

Advanced

Overview

How It Works

Architecture

Key Features

Seamless Translation

Unified Model Definition

Optimized Inference

Day-0 Support

Installation

Quick Start

Using the LLM API

Running the Demo Script

torch.export Usage

Configuration

Roadmap

Upcoming Model Support

Planned Features

Advanced Topics

Contributing

Build docs developers (and LLMs) love

Contributing

Extending

Advanced

​Overview

​How It Works

​Architecture

​Key Features

Seamless Translation

Unified Model Definition

Optimized Inference

Day-0 Support

​Installation

​Quick Start

​Using the LLM API

​Running the Demo Script

​torch.export Usage

​Configuration

​Roadmap

​Upcoming Model Support

​Planned Features

​Advanced Topics

​Contributing

​Related Pages

Build docs developers (and LLMs) love

Overview

How It Works

Architecture

Key Features

Installation

Quick Start

Using the LLM API

Running the Demo Script

torch.export Usage

Configuration

Roadmap

Upcoming Model Support

Planned Features

Advanced Topics

Contributing

Related Pages