Overview
This guide covers best practices, performance tuning, monitoring, and scaling strategies for production TensorRT-LLM deployments.Architecture Considerations
Choosing a Backend
PyTorch Backend
Default choice
- Best compatibility
- Active development
- Full feature support
- Easier debugging
TensorRT Backend
Maximum performance
- Lowest latency
- Highest throughput
- Requires build step
- Limited to specific models
AutoDeploy Backend
Experimental
- Automatic optimization
- On-the-fly quantization
- Beta stability
Deployment Patterns
- Single Server
- Load Balanced Fleet
- Disaggregated
Best for:
- Low to medium traffic
- Development/testing
- Models under 70B parameters on single GPU
Performance Tuning
Memory Management
Batching Configuration
config.yaml
Tuning Guidelines
Tuning Guidelines
| Workload | max_batch_size | max_num_tokens | Notes |
|---|---|---|---|
| Short prompts + outputs | 512 | 8192 | Maximize throughput |
| Long prompts | 128 | 32768 | Prevent OOM |
| Streaming responses | 256 | 16384 | Balance latency/throughput |
| Mixed workload | 256 | 16384 | Safe defaults |
CUDA Graphs
Enable CUDA graphs for 20-30% latency reduction on decode steps:Overlap Scheduler
Enable compute/communication overlap (PyTorch backend only):Monitoring and Observability
Metrics to Track
Request Metrics
Request Metrics
- Time to First Token (TTFT): Prefill latency
- Time Per Output Token (TPOT): Decode latency
- Request throughput: Requests/second
- Token throughput: Tokens/second
- Queue time: Time waiting in scheduler
System Metrics
System Metrics
- GPU utilization: Target >80%
- GPU memory usage: Monitor for OOM
- KV cache hit rate: Higher = better efficiency
- Active requests: Current concurrency
- Batch sizes: Average batch utilization
Error Metrics
Error Metrics
- Request failures: HTTP 5xx errors
- OOM errors: KV cache exhaustion
- Timeout errors: Requests exceeding max wait time
Collecting Metrics
OpenTelemetry Integration
Export traces to observability platforms:Requires OpenTelemetry Collector running. See OpenTelemetry docs for setup.
Scaling Strategies
Vertical Scaling (Single Node)
Horizontal Scaling (Multi-Instance)
Multi-Node Scaling
For models >70B parameters, use multi-node deployment:Security Best Practices
High Availability
Health Checks
Graceful Shutdown
Kubernetes Deployment
trtllm-deployment.yaml
Troubleshooting
Out of Memory (OOM) Errors
Out of Memory (OOM) Errors
Symptoms: Requests fail with CUDA OOM errorsSolutions:
- Reduce
max_batch_sizeormax_num_tokens - Lower
free_gpu_memory_fractionto 0.85 - Enable FP8 KV cache:
kv_cache_config.dtype: fp8 - Disable CUDA graphs if enabled
- Use tensor parallelism for larger models
High Latency
High Latency
Symptoms: TTFT or TPOT higher than expectedSolutions:
- Enable CUDA graphs:
use_cuda_graph: true - Enable overlap scheduler (PyTorch):
enable_overlap_scheduler: true - Increase
max_num_tokensto allow larger batches - Check GPU utilization (should be >80%)
- Reduce
tokens_per_blockto 16 for short requests
Low Throughput
Low Throughput
Symptoms: Requests/second below expectationsSolutions:
- Increase
max_batch_size - Enable KV cache reuse:
enable_block_reuse: true - Use async generation:
generate_async() - Check queue time in metrics (should be less than 100ms)
- Scale horizontally with load balancer
Model Not Loading
Model Not Loading
Symptoms: Server fails to start, model download issuesSolutions:
- Check HuggingFace token:
huggingface-cli login - Pre-download model:
huggingface-cli download <model> - Verify disk space (models can be 50GB+)
- Check model compatibility with backend
- Enable
--trust_remote_codeif using custom model code
Performance Checklist
Configure KV cache
✅
free_gpu_memory_fraction: 0.95
✅ enable_block_reuse: true
✅ dtype: fp8 (on Hopper GPUs)Next Steps
Distributed Inference
Multi-GPU and multi-node deployments
Benchmarking
Measure and optimize performance
Reference Configs
170+ optimized configurations
API Reference
Complete configuration reference