Installation
TensorRT-LLM can be installed in several ways depending on your needs. This guide covers all installation methods with system requirements and troubleshooting tips.
System Requirements
Before installing TensorRT-LLM, ensure your system meets these requirements:
Hardware Requirements
NVIDIA GPU with compute capability 7.0 or higher:
Volta (V100) - Compute capability 7.0
Turing (T4) - Compute capability 7.5
Ampere (A10, A100) - Compute capability 8.0, 8.6
Hopper (H100, H200) - Compute capability 9.0
Blackwell (B200) - Compute capability 10.0
8GB+ GPU memory minimum (16GB+ recommended for larger models)
50GB+ free disk space for Docker images and model caching
Software Requirements
Operating System : Ubuntu 22.04 or 24.04 LTS (recommended)
NVIDIA Driver : Version 535.x or newer
CUDA : Version 13.1 (automatically included in Docker containers)
Python : 3.10 or 3.12
PyTorch : 2.9.1 with CUDA 13.0 support
TensorRT-LLM requires specific versions of CUDA and PyTorch for optimal compatibility. Using mismatched versions may cause runtime errors.
Installation Methods
Docker (Recommended)
Pip Install
Build from Source
Docker Installation Docker is the recommended installation method as it includes all dependencies pre-configured and tested.
Install NVIDIA Container Toolkit
First, install Docker and the NVIDIA Container Toolkit to enable GPU access in containers: # Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Install NVIDIA Container Toolkit
distribution = $( . /etc/os-release ; echo $ID$VERSION_ID )
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ $distribution /nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Verify GPU access: docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi
Pull the TensorRT-LLM container
Download the latest release container from NGC: docker pull nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6
Available container types:
Release : Ready-to-use with TensorRT-LLM pre-installed
Devel : For development with build tools included
Launch the container
Start a container with GPU access and port forwarding: docker run --rm -it \
--ipc host \
--gpus all \
--ulimit memlock= -1 \
--ulimit stack= 67108864 \
-p 8000:8000 \
-v $( pwd ) :/workspace \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6
Explanation of flags:
--ipc host: Required to prevent bus errors
--gpus all: Enable all GPUs
--ulimit memlock=-1: Unlimited locked memory
--ulimit stack=67108864: Increased stack size
-p 8000:8000: Expose port for trtllm-serve
-v $(pwd):/workspace: Mount current directory
Verify installation
Inside the container, verify TensorRT-LLM is installed: python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
You should see the version number (e.g., 1.3.0rc6). Installing via pip Install TensorRT-LLM as a Python package using pip on Ubuntu 24.04.
Install CUDA Toolkit 13.1
Install CUDA following the official CUDA Installation Guide : # Download and install CUDA 13.1
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-1
# Set CUDA_HOME environment variable
export CUDA_HOME = / usr / local / cuda-13 . 1
export PATH = $CUDA_HOME / bin : $PATH
export LD_LIBRARY_PATH = $CUDA_HOME / lib64 : $LD_LIBRARY_PATH
Add the exports to your ~/.bashrc for persistence.
Install PyTorch with CUDA 13.0
Install PyTorch 2.9.1 with CUDA 13.0 support: pip3 install torch== 2.9.1 torchvision --index-url https://download.pytorch.org/whl/cu130
The default PyTorch package uses CUDA 12.8, which is incompatible with TensorRT-LLM. You must install the CUDA 13.0 version.
Install system dependencies
Install required system packages: sudo apt-get update
sudo apt-get -y install libopenmpi-dev
# Optional: Only required for disaggregated serving
sudo apt-get -y install libzmq3-dev
Install TensorRT-LLM
Install the TensorRT-LLM Python package: pip3 install --ignore-installed pip setuptools wheel
pip3 install tensorrt_llm
The installation may take several minutes as it downloads and installs dependencies including TensorRT and NCCL.
Verify installation
Test the installation with a simple script: from tensorrt_llm import LLM , SamplingParams
llm = LLM( model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" )
prompts = [
"Hello, my name is" ,
"The capital of France is" ,
]
sampling_params = SamplingParams( temperature = 0.8 , top_p = 0.95 )
for output in llm.generate(prompts, sampling_params):
print ( f "Prompt: { output.prompt !r} , Generated: { output.outputs[ 0 ].text !r} " )
Known Issues with Pip Installation
PyTorch version conflicts
On some systems, pip may replace your CUDA 13.0 PyTorch with the CUDA 12.8 version during installation. Solution : Create a constraints file to lock PyTorch version:CURRENT_TORCH_VERSION = $( python3 -c "import torch; print(torch.__version__)" )
echo "torch== $CURRENT_TORCH_VERSION " > /tmp/torch-constraint.txt
pip3 install tensorrt_llm -c /tmp/torch-constraint.txt
MPI in Slurm environments
If running in a Slurm-managed cluster, you may need to reconfigure MPI: # Check with your cluster administrator for Slurm-compatible MPI
# This is not TensorRT-LLM specific, but a general MPI+Slurm issue
You may need to rebuild OpenMPI with --with-slurm flag. Building from Source Build TensorRT-LLM from source for maximum performance, debugging, or custom CXX11 ABI configurations.
Clone the repository
Clone TensorRT-LLM with all submodules: # Install git-lfs first
sudo apt-get update && sudo apt-get -y install git git-lfs
git lfs install
# Clone repository
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull
The repository is approximately 2GB and requires git-lfs for large files.
Build development container
Create a Docker development container: # Build the development image (~63GB disk space required)
make -C docker build
# Run the development container
make -C docker run
Or without GNU make: docker build --pull \
--target devel \
--file docker/Dockerfile.multi \
--tag tensorrt_llm/devel:latest \
.
docker run --rm -it \
--ipc=host --ulimit memlock= -1 --ulimit stack= 67108864 \
--gpus=all \
--volume ${ PWD } :/code/tensorrt_llm \
--workdir /code/tensorrt_llm \
tensorrt_llm/devel:latest
You can also pull a pre-built development container from NGC instead of building it.
Build TensorRT-LLM
Inside the development container, build TensorRT-LLM: Option A: Full build with C++ compilation # Build C++ runtime and Python bindings
python3 scripts/build_wheel.py --clean --trt_root /usr/local/tensorrt
# Install the wheel
pip3 install build/tensorrt_llm * .whl
Option B: Quick Python-only build # Install in development mode (no C++ compilation)
pip3 install -e .
The full build takes 30-60 minutes depending on your CPU. The Python-only build is much faster but doesn’t include C++ optimizations.
Build with specific GPU architectures (optional)
To reduce build time, compile only for your GPU architecture: # Ada and Hopper only (RTX 40xx, H100)
python3 scripts/build_wheel.py --cuda_architectures "89-real;90-real"
# Ampere only (A100, A10)
python3 scripts/build_wheel.py --cuda_architectures "80-real;86-real"
Available architectures:
70: Volta (V100)
75: Turing (T4)
80,86: Ampere (A100, A10)
89: Ada (RTX 40xx)
90: Hopper (H100, H200)
100: Blackwell (B200)
Verify the build
Test the built package: python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
python3 -c "from tensorrt_llm import LLM; print('LLM API available')"
Build Options The build_wheel.py script supports many options: # Clean build
python3 scripts/build_wheel.py --clean
# Enable debug symbols
python3 scripts/build_wheel.py --build_type=Debug
# Parallel build with N jobs
python3 scripts/build_wheel.py -j 8
# Specify TensorRT root
python3 scripts/build_wheel.py --trt_root /path/to/tensorrt
# Use old CXX11 ABI
python3 scripts/build_wheel.py --use_old_abi
Verifying Your Installation
After installation, run this comprehensive test:
import sys
import torch
from tensorrt_llm import LLM , SamplingParams
print ( "=" * 50 )
print ( "TensorRT-LLM Installation Check" )
print ( "=" * 50 )
# Check Python version
print ( f "Python version: { sys.version } " )
# Check PyTorch and CUDA
print ( f "PyTorch version: { torch. __version__ } " )
print ( f "CUDA available: { torch.cuda.is_available() } " )
if torch.cuda.is_available():
print ( f "CUDA version: { torch.version.cuda } " )
print ( f "GPU: { torch.cuda.get_device_name( 0 ) } " )
print ( f "GPU memory: { torch.cuda.get_device_properties( 0 ).total_memory / 1e9 :.2f} GB" )
# Check TensorRT-LLM
import tensorrt_llm
print ( f "TensorRT-LLM version: { tensorrt_llm. __version__ } " )
# Quick inference test
print ( " \n Running quick inference test..." )
try :
llm = LLM( model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" )
output = llm.generate( "Hello, my name is" , SamplingParams( max_tokens = 10 ))
print ( f "✓ Inference successful!" )
print ( f " Generated: { output[ 0 ].outputs[ 0 ].text } " )
except Exception as e:
print ( f "✗ Inference failed: { e } " )
print ( " \n " + "=" * 50 )
Environment Configuration
Recommended Environment Variables
# CUDA paths
export CUDA_HOME = / usr / local / cuda-13 . 1
export PATH = $CUDA_HOME / bin : $PATH
export LD_LIBRARY_PATH = $CUDA_HOME / lib64 : $LD_LIBRARY_PATH
# TensorRT-LLM settings
export TRTLLM_LOG_LEVEL = INFO # DEBUG, INFO, WARNING, ERROR
# Model cache directory (default: ~/.cache/huggingface)
export HF_HOME = / path / to / model / cache
# HuggingFace token for private models
export HF_TOKEN = your_hf_token_here
# Multi-GPU settings
export NCCL_DEBUG = INFO # For debugging distributed training
export CUDA_VISIBLE_DEVICES = 0 , 1 , 2 , 3 # Specify GPUs to use
# Enable TensorFloat-32 on Ampere+ GPUs
export NVIDIA_TF32_OVERRIDE = 1
# Disable cudnn benchmarking (faster startup)
export CUDNN_BENCHMARK = 0
# Increase shared memory for large models
export NCCL_SHM_DISABLE = 0
Upgrading TensorRT-LLM
Pull the latest container: docker pull nvcr.io/nvidia/tensorrt-llm/release:latest
Upgrade via pip: pip3 install --upgrade tensorrt_llm
Pull latest changes and rebuild: cd TensorRT-LLM
git pull
git submodule update --recursive
git lfs pull
# Rebuild
python3 scripts/build_wheel.py --clean
pip3 install --force-reinstall build/tensorrt_llm * .whl
Troubleshooting
CUDA out of memory errors
Symptoms : RuntimeError: CUDA out of memorySolutions :
Use a smaller model or quantized version
Reduce batch size or sequence length
Enable KV cache offloading
Use tensor parallelism across multiple GPUs
Check for memory leaks (restart Python kernel)
from tensorrt_llm import LLM , KvCacheConfig
llm = LLM(
model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" ,
kv_cache_config = KvCacheConfig( free_gpu_memory_fraction = 0.6 )
)
Symptoms : ModuleNotFoundError or ImportErrorSolutions :
Verify PyTorch CUDA version matches TensorRT-LLM:
python3 -c "import torch; print(torch.version.cuda)"
Check for conflicting installations:
Reinstall with correct PyTorch:
pip3 uninstall tensorrt_llm torch
pip3 install torch== 2.9.1 --index-url https://download.pytorch.org/whl/cu130
pip3 install tensorrt_llm
Driver/CUDA version mismatch
Symptoms : CUDA driver version is insufficient for CUDA runtime versionSolutions :
Update NVIDIA driver:
ubuntu-drivers devices
sudo apt install nvidia-driver-535
Or use CUDA compatibility package:
sudo apt-get install cuda-compat-13-1
Verify driver version:
Compilation errors when building from source
Symptoms : Build failures with GCC/C++ errorsSolutions :
Ensure you’re using the development container
Update submodules:
git submodule update --init --recursive
Clean build directory:
rm -rf build/
python3 scripts/build_wheel.py --clean
Check disk space (requires 63GB+)
Next Steps
Now that TensorRT-LLM is installed, you can:
Try the Quickstart Run your first inference in minutes with simple examples
Explore Examples Learn advanced features like speculative decoding and multi-GPU inference
Deploy Models Production deployment guides for popular models
Optimize Performance Benchmark and tune for maximum throughput
Getting Help
If you encounter issues:
When reporting issues, include your GPU model, driver version, TensorRT-LLM version, and a minimal reproducible example.