Installation

TensorRT-LLM can be installed in several ways depending on your needs. This guide covers all installation methods with system requirements and troubleshooting tips.

System Requirements

Before installing TensorRT-LLM, ensure your system meets these requirements:

Hardware Requirements

NVIDIA GPU with compute capability 7.0 or higher:
- Volta (V100) - Compute capability 7.0
- Turing (T4) - Compute capability 7.5
- Ampere (A10, A100) - Compute capability 8.0, 8.6
- Hopper (H100, H200) - Compute capability 9.0
- Blackwell (B200) - Compute capability 10.0
8GB+ GPU memory minimum (16GB+ recommended for larger models)
50GB+ free disk space for Docker images and model caching

Software Requirements

Operating System: Ubuntu 22.04 or 24.04 LTS (recommended)
NVIDIA Driver: Version 535.x or newer
CUDA: Version 13.1 (automatically included in Docker containers)
Python: 3.10 or 3.12
PyTorch: 2.9.1 with CUDA 13.0 support

TensorRT-LLM requires specific versions of CUDA and PyTorch for optimal compatibility. Using mismatched versions may cause runtime errors.

Installation Methods

Docker (Recommended)
Pip Install
Build from Source

Docker Installation

Docker is the recommended installation method as it includes all dependencies pre-configured and tested.

Install NVIDIA Container Toolkit

First, install Docker and the NVIDIA Container Toolkit to enable GPU access in containers:

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify GPU access:

docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

Pull the TensorRT-LLM container

Download the latest release container from NGC:

docker pull nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6

Check the NGC catalog for the latest version tag.

Available container types:

Release: Ready-to-use with TensorRT-LLM pre-installed
Devel: For development with build tools included

Launch the container

Start a container with GPU access and port forwarding:

docker run --rm -it \
  --ipc host \
  --gpus all \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -v $(pwd):/workspace \
  nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6

Explanation of flags:

--ipc host: Required to prevent bus errors
--gpus all: Enable all GPUs
--ulimit memlock=-1: Unlimited locked memory
--ulimit stack=67108864: Increased stack size
-p 8000:8000: Expose port for trtllm-serve
-v $(pwd):/workspace: Mount current directory

Verify installation

Inside the container, verify TensorRT-LLM is installed:

python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

You should see the version number (e.g., 1.3.0rc6).

Installing via pip

Install TensorRT-LLM as a Python package using pip on Ubuntu 24.04.

Install CUDA Toolkit 13.1

Install CUDA following the official CUDA Installation Guide:

# Download and install CUDA 13.1
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-1

# Set CUDA_HOME environment variable
export CUDA_HOME=/usr/local/cuda-13.1
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

Add the exports to your ~/.bashrc for persistence.

Install PyTorch with CUDA 13.0

Install PyTorch 2.9.1 with CUDA 13.0 support:

pip3 install torch==2.9.1 torchvision --index-url https://download.pytorch.org/whl/cu130

The default PyTorch package uses CUDA 12.8, which is incompatible with TensorRT-LLM. You must install the CUDA 13.0 version.

Install system dependencies

Install required system packages:

sudo apt-get update
sudo apt-get -y install libopenmpi-dev

# Optional: Only required for disaggregated serving
sudo apt-get -y install libzmq3-dev

Install TensorRT-LLM

Install the TensorRT-LLM Python package:

pip3 install --ignore-installed pip setuptools wheel
pip3 install tensorrt_llm

The installation may take several minutes as it downloads and installs dependencies including TensorRT and NCCL.

Verify installation

Test the installation with a simple script:

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

prompts = [
    "Hello, my name is",
    "The capital of France is",
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

for output in llm.generate(prompts, sampling_params):
    print(f"Prompt: {output.prompt!r}, Generated: {output.outputs[0].text!r}")

Known Issues with Pip Installation

PyTorch version conflicts

On some systems, pip may replace your CUDA 13.0 PyTorch with the CUDA 12.8 version during installation.Solution: Create a constraints file to lock PyTorch version:

CURRENT_TORCH_VERSION=$(python3 -c "import torch; print(torch.__version__)")
echo "torch==$CURRENT_TORCH_VERSION" > /tmp/torch-constraint.txt
pip3 install tensorrt_llm -c /tmp/torch-constraint.txt

MPI in Slurm environments

If running in a Slurm-managed cluster, you may need to reconfigure MPI:

# Check with your cluster administrator for Slurm-compatible MPI
# This is not TensorRT-LLM specific, but a general MPI+Slurm issue

You may need to rebuild OpenMPI with --with-slurm flag.

Building from Source

Build TensorRT-LLM from source for maximum performance, debugging, or custom CXX11 ABI configurations.

Clone the repository

Clone TensorRT-LLM with all submodules:

# Install git-lfs first
sudo apt-get update && sudo apt-get -y install git git-lfs
git lfs install

# Clone repository
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull

The repository is approximately 2GB and requires git-lfs for large files.

Build development container

Create a Docker development container:

# Build the development image (~63GB disk space required)
make -C docker build

# Run the development container
make -C docker run

Or without GNU make:

docker build --pull \
  --target devel \
  --file docker/Dockerfile.multi \
  --tag tensorrt_llm/devel:latest \
  .

docker run --rm -it \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  --gpus=all \
  --volume ${PWD}:/code/tensorrt_llm \
  --workdir /code/tensorrt_llm \
  tensorrt_llm/devel:latest

You can also pull a pre-built development container from NGC instead of building it.

Build TensorRT-LLM

Inside the development container, build TensorRT-LLM:Option A: Full build with C++ compilation

# Build C++ runtime and Python bindings
python3 scripts/build_wheel.py --clean --trt_root /usr/local/tensorrt

# Install the wheel
pip3 install build/tensorrt_llm*.whl

Option B: Quick Python-only build

# Install in development mode (no C++ compilation)
pip3 install -e .

The full build takes 30-60 minutes depending on your CPU. The Python-only build is much faster but doesn’t include C++ optimizations.

Build with specific GPU architectures (optional)

To reduce build time, compile only for your GPU architecture:

# Ada and Hopper only (RTX 40xx, H100)
python3 scripts/build_wheel.py --cuda_architectures "89-real;90-real"

# Ampere only (A100, A10)
python3 scripts/build_wheel.py --cuda_architectures "80-real;86-real"

Available architectures:

70: Volta (V100)
75: Turing (T4)
80,86: Ampere (A100, A10)
89: Ada (RTX 40xx)
90: Hopper (H100, H200)
100: Blackwell (B200)

Verify the build

Test the built package:

python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
python3 -c "from tensorrt_llm import LLM; print('LLM API available')"

Build Options

The build_wheel.py script supports many options:

# Clean build
python3 scripts/build_wheel.py --clean

# Enable debug symbols
python3 scripts/build_wheel.py --build_type=Debug

# Parallel build with N jobs
python3 scripts/build_wheel.py -j 8

# Specify TensorRT root
python3 scripts/build_wheel.py --trt_root /path/to/tensorrt

# Use old CXX11 ABI
python3 scripts/build_wheel.py --use_old_abi

Verifying Your Installation

After installation, run this comprehensive test:

test_installation.py

import sys
import torch
from tensorrt_llm import LLM, SamplingParams

print("=" * 50)
print("TensorRT-LLM Installation Check")
print("=" * 50)

# Check Python version
print(f"Python version: {sys.version}")

# Check PyTorch and CUDA
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# Check TensorRT-LLM
import tensorrt_llm
print(f"TensorRT-LLM version: {tensorrt_llm.__version__}")

# Quick inference test
print("\nRunning quick inference test...")
try:
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
    output = llm.generate("Hello, my name is", SamplingParams(max_tokens=10))
    print(f"✓ Inference successful!")
    print(f"  Generated: {output[0].outputs[0].text}")
except Exception as e:
    print(f"✗ Inference failed: {e}")

print("\n" + "=" * 50)

Environment Configuration

Recommended Environment Variables

# CUDA paths
export CUDA_HOME=/usr/local/cuda-13.1
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# TensorRT-LLM settings
export TRTLLM_LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR

# Model cache directory (default: ~/.cache/huggingface)
export HF_HOME=/path/to/model/cache

# HuggingFace token for private models
export HF_TOKEN=your_hf_token_here

# Multi-GPU settings
export NCCL_DEBUG=INFO  # For debugging distributed training
export CUDA_VISIBLE_DEVICES=0,1,2,3  # Specify GPUs to use

Performance Tuning

# Enable TensorFloat-32 on Ampere+ GPUs
export NVIDIA_TF32_OVERRIDE=1

# Disable cudnn benchmarking (faster startup)
export CUDNN_BENCHMARK=0

# Increase shared memory for large models
export NCCL_SHM_DISABLE=0

Upgrading TensorRT-LLM

Docker
Pip
Source

Pull the latest container:

docker pull nvcr.io/nvidia/tensorrt-llm/release:latest

Upgrade via pip:

pip3 install --upgrade tensorrt_llm

Pull latest changes and rebuild:

cd TensorRT-LLM
git pull
git submodule update --recursive
git lfs pull

# Rebuild
python3 scripts/build_wheel.py --clean
pip3 install --force-reinstall build/tensorrt_llm*.whl

Troubleshooting

CUDA out of memory errors

Symptoms: RuntimeError: CUDA out of memorySolutions:

Use a smaller model or quantized version
Reduce batch size or sequence length
Enable KV cache offloading
Use tensor parallelism across multiple GPUs
Check for memory leaks (restart Python kernel)

from tensorrt_llm import LLM, KvCacheConfig

llm = LLM(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.6)
)

Import errors

Symptoms: ModuleNotFoundError or ImportErrorSolutions:

Verify PyTorch CUDA version matches TensorRT-LLM:

python3 -c "import torch; print(torch.version.cuda)"

Check for conflicting installations:
```
pip list | grep tensorrt
```

Reinstall with correct PyTorch:

pip3 uninstall tensorrt_llm torch
pip3 install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu130
pip3 install tensorrt_llm

Driver/CUDA version mismatch

Symptoms: CUDA driver version is insufficient for CUDA runtime versionSolutions:

Update NVIDIA driver:

ubuntu-drivers devices
sudo apt install nvidia-driver-535

Or use CUDA compatibility package:
```
sudo apt-get install cuda-compat-13-1
```
Verify driver version:
```
nvidia-smi
```

Compilation errors when building from source

Symptoms: Build failures with GCC/C++ errorsSolutions:

Ensure you’re using the development container

Update submodules:

git submodule update --init --recursive

Clean build directory:

rm -rf build/
python3 scripts/build_wheel.py --clean

Check disk space (requires 63GB+)

Next Steps

Now that TensorRT-LLM is installed, you can:

Try the Quickstart

Run your first inference in minutes with simple examples

Explore Examples

Learn advanced features like speculative decoding and multi-GPU inference

Deploy Models

Production deployment guides for popular models

Optimize Performance

Benchmark and tune for maximum throughput

Getting Help

If you encounter issues:

Documentation: nvidia.github.io/TensorRT-LLM
GitHub Issues: github.com/NVIDIA/TensorRT-LLM/issues
NGC Support: NGC Support Portal
Developer Forums: NVIDIA Developer Forums

When reporting issues, include your GPU model, driver version, TensorRT-LLM version, and a minimal reproducible example.

Get Started

Core Concepts

Deployment

Models

Features

Performance

Installation

Installation

System Requirements

Hardware Requirements

Software Requirements

Installation Methods

Docker Installation

Installing via pip

Known Issues with Pip Installation

Building from Source

Build Options

Verifying Your Installation

Environment Configuration

Recommended Environment Variables

Performance Tuning

Upgrading TensorRT-LLM

Troubleshooting

Next Steps

Try the Quickstart

Explore Examples

Deploy Models

Optimize Performance

Getting Help

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

Models

Features

Performance

​Installation

​System Requirements

​Hardware Requirements

​Software Requirements

​Installation Methods

​Docker Installation

​Installing via pip

​Known Issues with Pip Installation

​Building from Source

​Build Options

​Verifying Your Installation

​Environment Configuration

​Recommended Environment Variables

​Performance Tuning

​Upgrading TensorRT-LLM

​Troubleshooting

​Next Steps

Try the Quickstart

Explore Examples

Deploy Models

Optimize Performance

​Getting Help

Build docs developers (and LLMs) love

Installation

System Requirements

Hardware Requirements

Software Requirements

Installation Methods

Docker Installation

Installing via pip

Known Issues with Pip Installation

Building from Source

Build Options

Verifying Your Installation

Environment Configuration

Recommended Environment Variables

Performance Tuning

Upgrading TensorRT-LLM

Troubleshooting

Next Steps

Getting Help