Architecture Overview
Multimodal LLMs typically handle non-text inputs by combining a multimodal encoder with an LLM decoder:Multimodal Input Processor
Preprocesses raw multimodal input (images, audio) into a format suitable for the encoder, such as pixel values or spectrograms.
Multimodal Encoder
Encodes the processed input into embeddings aligned with the LLM’s embedding space (e.g., vision transformers for images).
Supported Models
TensorRT-LLM supports a wide range of multimodal architectures:Vision-Language Models
- LLaVA (LLaMA + Vision)
- VILA (Visual Language Assistant)
- Qwen2-VL (Qwen with Vision)
- NVILA (NVIDIA Vision-Language)
- BLIP2 (Bootstrapped Language-Image Pre-training)
- Nougat (Neural OCR for documents)
Audio Models
- Whisper (Speech recognition)
- Audio-language models (coming soon)
For the complete and up-to-date support matrix, see the Multimodal Feature Support Matrix.
Optimizations
TensorRT-LLM incorporates key optimizations to enhance multimodal inference performance:In-Flight Batching
In-Flight Batching
Batches multimodal requests within the GPU executor to improve GPU utilization and throughput. Context-phase (image encoding) and generation-phase requests are batched together.
CPU/GPU Concurrency
CPU/GPU Concurrency
Asynchronously overlaps data preprocessing on the CPU with image encoding on the GPU, reducing end-to-end latency.
Raw Data Hashing
Raw Data Hashing
Leverages image hashes and token chunk information to improve KV cache reuse and minimize collisions. Identical images across requests share cached encoder outputs.
Quick Start
Basic Usage
Run a vision-language model with a single image:Multiple Images
Process multiple images in a single prompt:KV Cache Reuse with UUIDs
For better cache management across sessions, provide custom UUIDs:Why use UUIDs? Custom UUIDs enable deterministic cache management. The same UUID + content combination always produces the same cache key, allowing you to:
- Track cache entries externally
- Implement per-user cache isolation
- Pre-warm cache with known images
- Manage cache lifecycle across sessions
Serving Multimodal Models
Start OpenAI-Compatible Server
Launch a server with multimodal support:Send Requests with Images
- Python (OpenAI SDK)
- cURL
Benchmarking
Evaluate multimodal inference performance:For detailed benchmarking instructions, see the performance benchmarking guide.
Configuration Options
Disable KV Cache Reuse
For testing or when cache reuse is not beneficial:Multimodal-Specific Cache Settings
Model-Specific Examples
LLaVA
NVILA
Qwen2-VL
Best Practices
Image Preprocessing
Image Preprocessing
- Resize images to model’s expected resolution before inference
- Use appropriate image format (JPEG, PNG) based on content
- Normalize pixel values according to model requirements
- Batch multiple images when possible for better throughput
KV Cache Management
KV Cache Management
- Enable
enable_block_reuse=Truefor scenarios with repeated images - Use custom
multi_modal_uuidsfor deterministic cache keys - Allocate sufficient GPU memory for KV cache (90%+ of free memory)
- Consider FP8 KV cache for 2x memory savings
Prompt Engineering
Prompt Engineering
- Follow model-specific prompt templates (LLaVA uses
USER:/ASSISTANT:, Qwen uses special tokens) - Place image tokens where the model expects them
- Be explicit about what you want the model to analyze
- For multiple images, clearly reference which image you’re asking about
Performance Optimization
Performance Optimization
- Use in-flight batching to mix image encoding and text generation
- Enable CPU/GPU concurrency for image preprocessing
- Monitor cache hit rates for repeated images
- Benchmark with representative workloads
Limitations
Complete Example
Here’s a full example with all best practices:Additional Resources
Multimodal Examples
Complete quickstart example for multimodal models
Supported Models
Full multimodal model support matrix
Serving Script
Example serving client for multimodal requests
Benchmarking Guide
Measure multimodal inference performance