Skip to main content

Installation

TensorRT-LLM can be installed in several ways depending on your needs. This guide covers all installation methods with system requirements and troubleshooting tips.

System Requirements

Before installing TensorRT-LLM, ensure your system meets these requirements:

Hardware Requirements

  • NVIDIA GPU with compute capability 7.0 or higher:
    • Volta (V100) - Compute capability 7.0
    • Turing (T4) - Compute capability 7.5
    • Ampere (A10, A100) - Compute capability 8.0, 8.6
    • Hopper (H100, H200) - Compute capability 9.0
    • Blackwell (B200) - Compute capability 10.0
  • 8GB+ GPU memory minimum (16GB+ recommended for larger models)
  • 50GB+ free disk space for Docker images and model caching

Software Requirements

  • Operating System: Ubuntu 22.04 or 24.04 LTS (recommended)
  • NVIDIA Driver: Version 535.x or newer
  • CUDA: Version 13.1 (automatically included in Docker containers)
  • Python: 3.10 or 3.12
  • PyTorch: 2.9.1 with CUDA 13.0 support
TensorRT-LLM requires specific versions of CUDA and PyTorch for optimal compatibility. Using mismatched versions may cause runtime errors.

Installation Methods

Verifying Your Installation

After installation, run this comprehensive test:
test_installation.py
import sys
import torch
from tensorrt_llm import LLM, SamplingParams

print("=" * 50)
print("TensorRT-LLM Installation Check")
print("=" * 50)

# Check Python version
print(f"Python version: {sys.version}")

# Check PyTorch and CUDA
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# Check TensorRT-LLM
import tensorrt_llm
print(f"TensorRT-LLM version: {tensorrt_llm.__version__}")

# Quick inference test
print("\nRunning quick inference test...")
try:
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
    output = llm.generate("Hello, my name is", SamplingParams(max_tokens=10))
    print(f"✓ Inference successful!")
    print(f"  Generated: {output[0].outputs[0].text}")
except Exception as e:
    print(f"✗ Inference failed: {e}")

print("\n" + "=" * 50)

Environment Configuration

# CUDA paths
export CUDA_HOME=/usr/local/cuda-13.1
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# TensorRT-LLM settings
export TRTLLM_LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR

# Model cache directory (default: ~/.cache/huggingface)
export HF_HOME=/path/to/model/cache

# HuggingFace token for private models
export HF_TOKEN=your_hf_token_here

# Multi-GPU settings
export NCCL_DEBUG=INFO  # For debugging distributed training
export CUDA_VISIBLE_DEVICES=0,1,2,3  # Specify GPUs to use

Performance Tuning

# Enable TensorFloat-32 on Ampere+ GPUs
export NVIDIA_TF32_OVERRIDE=1

# Disable cudnn benchmarking (faster startup)
export CUDNN_BENCHMARK=0

# Increase shared memory for large models
export NCCL_SHM_DISABLE=0

Upgrading TensorRT-LLM

Pull the latest container:
docker pull nvcr.io/nvidia/tensorrt-llm/release:latest

Troubleshooting

Symptoms: RuntimeError: CUDA out of memorySolutions:
  1. Use a smaller model or quantized version
  2. Reduce batch size or sequence length
  3. Enable KV cache offloading
  4. Use tensor parallelism across multiple GPUs
  5. Check for memory leaks (restart Python kernel)
from tensorrt_llm import LLM, KvCacheConfig

llm = LLM(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.6)
)
Symptoms: ModuleNotFoundError or ImportErrorSolutions:
  1. Verify PyTorch CUDA version matches TensorRT-LLM:
    python3 -c "import torch; print(torch.version.cuda)"
    
  2. Check for conflicting installations:
    pip list | grep tensorrt
    
  3. Reinstall with correct PyTorch:
    pip3 uninstall tensorrt_llm torch
    pip3 install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu130
    pip3 install tensorrt_llm
    
Symptoms: CUDA driver version is insufficient for CUDA runtime versionSolutions:
  1. Update NVIDIA driver:
    ubuntu-drivers devices
    sudo apt install nvidia-driver-535
    
  2. Or use CUDA compatibility package:
    sudo apt-get install cuda-compat-13-1
    
  3. Verify driver version:
    nvidia-smi
    
Symptoms: Build failures with GCC/C++ errorsSolutions:
  1. Ensure you’re using the development container
  2. Update submodules:
    git submodule update --init --recursive
    
  3. Clean build directory:
    rm -rf build/
    python3 scripts/build_wheel.py --clean
    
  4. Check disk space (requires 63GB+)

Next Steps

Now that TensorRT-LLM is installed, you can:

Try the Quickstart

Run your first inference in minutes with simple examples

Explore Examples

Learn advanced features like speculative decoding and multi-GPU inference

Deploy Models

Production deployment guides for popular models

Optimize Performance

Benchmark and tune for maximum throughput

Getting Help

If you encounter issues:
When reporting issues, include your GPU model, driver version, TensorRT-LLM version, and a minimal reproducible example.

Build docs developers (and LLMs) love