Introduction to TensorRT-LLM
TensorRT-LLM is an open-source library for optimizing Large Language Model (LLM) inference on NVIDIA GPUs. It provides state-of-the-art optimizations and an easy-to-use Python API to define LLMs and perform inference efficiently.What is TensorRT-LLM?
Built on PyTorch, TensorRT-LLM delivers exceptional performance for LLM inference through advanced optimizations including:- Custom attention kernels for accelerated attention mechanisms
- In-flight batching to maximize GPU utilization
- Paged KV caching for efficient memory management
- Advanced quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant)
- Speculative decoding for faster token generation
- Multi-GPU and multi-node support with tensor and pipeline parallelism
Quantization
Support for FP8, FP4, INT4, and INT8 quantization to reduce memory footprint and increase throughput while maintaining accuracy
Speculative Decoding
Generate multiple tokens per iteration using draft models, N-gram matching, or Medusa/Eagle techniques for up to 3x speedup
Paged KV Cache
Efficient memory management with paged attention inspired by virtual memory, enabling longer sequences and higher batch sizes
In-Flight Batching
Dynamic batching that continuously processes new requests as they arrive, maximizing GPU utilization
Multi-GPU Support
Scale inference across multiple GPUs and nodes with tensor parallelism, pipeline parallelism, and expert parallelism for MoE models
OpenAI Compatible
Drop-in replacement for OpenAI API with trtllm-serve, supporting /v1/chat/completions and /v1/completions endpoints
Why TensorRT-LLM?
Performance
TensorRT-LLM delivers industry-leading performance:- 40,000+ tokens/second with Llama 4 on NVIDIA B200 GPUs
- 2-4x throughput improvement over standard PyTorch implementations
- Up to 3.6x speedup with speculative decoding techniques
- Sub-millisecond latency for first token with optimized configurations
Flexibility
TensorRT-LLM is designed to be modular and easy to customize:- PyTorch-native architecture allows developers to experiment and extend functionality
- High-level LLM API abstracts complexity while providing fine-grained control
- Pre-defined models with native PyTorch code that can be easily adapted
- Multiple backends including PyTorch, AutoDeploy (beta), and TensorRT
Production-Ready
- OpenAI-compatible API via
trtllm-servefor seamless integration - Triton Inference Server integration for enterprise deployments
- Docker containers on NGC for easy deployment
- Pre-quantized models on Hugging Face ready for immediate use
Key Features
Advanced Quantization
Advanced Quantization
TensorRT-LLM supports multiple quantization formats:
- FP8: 8-bit floating point for Hopper GPUs (H100, H200)
- FP4: 4-bit floating point for Blackwell GPUs (B200)
- INT4 AWQ: Activation-aware Weight Quantization
- INT8 SmoothQuant: Smooth activation quantization
- Mixed precision: Different quantization per layer or component
Speculative Decoding
Speculative Decoding
Generate tokens faster with various speculative decoding techniques:
- Draft-target: Use a smaller draft model to predict multiple tokens
- N-gram matching: Reuse previous generations for common patterns
- Eagle3: Advanced speculative decoding with confidence scores
- MTP (Medusa Tree Predict): Multi-token prediction for thinking models
Disaggregated Serving
Disaggregated Serving
Separate prefill (context encoding) and decode (token generation) across different GPUs:
- Independent scaling: Scale prefill and decode separately based on workload
- Resource optimization: Use different GPU types for different stages
- KV cache transfer: Exchange KV cache via NCCL, UCX, or MPI
Long Context Support
Long Context Support
Handle extremely long sequences efficiently:
- Skip Softmax Attention: Optimize attention for long contexts
- Paged KV cache: Efficient memory management for long sequences
- Streaming KV cache: Process sequences longer than GPU memory
- Context chunking: Automatic chunking for multi-million token contexts
Supported Models
TensorRT-LLM supports a wide range of popular LLM architectures:- Llama family: Llama 3, 3.1, 3.2, 3.3, Llama 4
- Qwen family: Qwen 2, 2.5, Qwen3, QwenVL
- DeepSeek: DeepSeek V2, V3, DeepSeek-R1
- Mixtral: Mixtral 8x7B, 8x22B
- Phi: Phi-3, Phi-4
- GPT models: GPT-2, GPT-J, GPT-NeoX
- Multimodal models: LLaVA, Qwen-VL, Mllama
- And many more…
Architecture
TensorRT-LLM provides three execution backends:- PyTorch Backend (Default): Native PyTorch execution with TensorRT-LLM optimizations
- AutoDeploy (Beta): Automatic graph optimization using torch.export
- TensorRT Backend (Legacy): Pure TensorRT engine-based execution
The PyTorch backend is the recommended default for most use cases, offering the best balance of performance, flexibility, and ease of use.
Getting Started
Ready to start using TensorRT-LLM? Here’s where to go next:Quickstart
Get up and running with TensorRT-LLM in minutes using Docker and simple Python examples
Installation
Detailed installation instructions for pip, Docker, and building from source
Examples
Explore comprehensive examples for different use cases and advanced features
API Reference
Complete API documentation for the LLM class and all configuration options
Community and Support
- GitHub: NVIDIA/TensorRT-LLM
- Documentation: nvidia.github.io/TensorRT-LLM
- NGC Containers: NGC Catalog
- Tech Blogs: Latest research and optimizations
Benchmarks and Performance
TensorRT-LLM consistently delivers state-of-the-art performance:- Llama 4 Maverick: 1,000+ TPS/user on Blackwell B200
- DeepSeek-R1: World-record inference performance on B200 GPUs
- Llama 3.3 70B: 3x throughput with speculative decoding
- GPT-OSS-120B: High-performance serving with day-0 support