Quickstart Guide
This guide will get you up and running with TensorRT-LLM in minutes. You’ll learn how to deploy a model for online serving and run offline inference using the Python API.Prerequisites
Before you begin, ensure you have:- NVIDIA GPU with compute capability 7.0+ (Volta, Turing, Ampere, Hopper, or Blackwell)
- NVIDIA Driver version 535+ installed
- Docker with NVIDIA Container Toolkit installed
- 8GB+ GPU memory (for TinyLlama example; larger models require more)
If you don’t have Docker installed, follow the NVIDIA Container Toolkit installation guide.
Launch Docker Container
Pull and run the TensorRT-LLM container
The TensorRT-LLM container comes with all dependencies pre-installed. Start the container with GPU access:The
-p 8000:8000 flag exposes port 8000 for the serving API.Option 1: Online Serving with trtllm-serve
The fastest way to deploy a model is usingtrtllm-serve, which provides an OpenAI-compatible API.
Start the server
Launch a model server using The server will:
trtllm-serve. This example uses TinyLlama for quick testing:- Download the model from Hugging Face (on first run)
- Optimize the model for your GPU
- Start an OpenAI-compatible HTTP server on port 8000
The first run may take a few minutes to download and optimize the model. Subsequent runs are much faster as the optimized model is cached.
Send a test request
Open a new terminal and attach to the running container:Or from outside the container (if you exposed port 8000), send a curl request:Example Response:
Try a quantized model (optional)
For better performance, use a pre-quantized FP8 model:Browse more pre-optimized models in the NVIDIA Model Optimizer collection.
Streaming Responses
To enable streaming (useful for chat applications), add"stream": true to your request:
Option 2: Offline Inference with LLM API
For batch processing or integration into Python applications, use the LLM API directly.Create a Python script
Create a file called This code:
quickstart.py with the following code:quickstart.py
- Loads TinyLlama from Hugging Face
- Defines three sample prompts
- Generates completions with temperature sampling
- Prints the results
Loading Quantized Models
Load pre-quantized models directly from Hugging Face for optimal performance:Quantized models require significantly less GPU memory and can achieve 2-4x higher throughput with minimal accuracy loss.
Customizing Generation
TheSamplingParams class provides fine-grained control over text generation:
Next Steps
Congratulations! You’ve successfully run your first TensorRT-LLM inference. Here’s what to explore next:Installation Options
Learn about pip installation, building from source, and system requirements
Advanced Examples
Explore speculative decoding, multi-GPU inference, LoRA adapters, and more
Model Support
Check which models are supported and how to add custom models
Performance Tuning
Learn how to benchmark and optimize performance for your workload
Common Issues
Out of memory errors
Out of memory errors
If you encounter CUDA out of memory errors:
- Use a smaller model (e.g., TinyLlama instead of Llama-70B)
- Reduce batch size with
max_batch_sizeparameter - Use quantized models (FP8, INT4) to reduce memory footprint
- Enable KV cache offloading for long sequences
Model download fails
Model download fails
If model download from Hugging Face fails:
- Check your internet connection
- Set HuggingFace token if the model requires authentication:
- Download the model manually and pass the local path:
Server connection refused
Server connection refused
If you can’t connect to the trtllm-serve server:
- Ensure port 8000 is exposed with
-p 8000:8000in docker run - Check the server started successfully (no errors in logs)
- Try connecting from inside the container first:
- Check firewall settings if connecting from another machine