Skip to main content

Overview

This guide covers best practices, performance tuning, monitoring, and scaling strategies for production TensorRT-LLM deployments.

Architecture Considerations

Choosing a Backend

PyTorch Backend

Default choice
  • Best compatibility
  • Active development
  • Full feature support
  • Easier debugging

TensorRT Backend

Maximum performance
  • Lowest latency
  • Highest throughput
  • Requires build step
  • Limited to specific models

AutoDeploy Backend

Experimental
  • Automatic optimization
  • On-the-fly quantization
  • Beta stability
Start with PyTorch backend for development and production. Switch to TensorRT backend only if you need absolute maximum performance and your model is supported.

Deployment Patterns

Best for:
  • Low to medium traffic
  • Development/testing
  • Models under 70B parameters on single GPU
trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --max_batch_size 128 \
  --max_num_tokens 8192

Performance Tuning

Memory Management

1

Optimize KV cache allocation

config.yaml
kv_cache_config:
  free_gpu_memory_fraction: 0.95  # Use 95% of free memory
  enable_block_reuse: true        # Reuse cache blocks
  tokens_per_block: 32            # Larger blocks = less overhead
Setting free_gpu_memory_fraction too high may cause OOM errors during traffic spikes.
2

Enable FP8 KV cache

kv_cache_config:
  dtype: fp8  # 2x memory savings vs FP16
FP8 KV cache requires Hopper GPUs (H100, H200) for best performance.
3

Monitor cache efficiency

curl http://localhost:8000/metrics | jq '.[] | .kvCacheStats'
Target metrics:
  • Cache hit rate: >50% for similar prompts
  • Free blocks: >10% of max blocks
  • Tokens per block: 32 (default)

Batching Configuration

config.yaml
max_batch_size: 256        # Max concurrent requests
max_num_tokens: 16384      # Max tokens across batch

scheduler_config:
  capacity_scheduler_policy: GUARANTEED_NO_EVICT
  dynamic_batch_config:
    enable_batch_size_tuning: true
    enable_max_num_tokens_tuning: false
    dynamic_batch_moving_average_window: 128
Workloadmax_batch_sizemax_num_tokensNotes
Short prompts + outputs5128192Maximize throughput
Long prompts12832768Prevent OOM
Streaming responses25616384Balance latency/throughput
Mixed workload25616384Safe defaults

CUDA Graphs

Enable CUDA graphs for 20-30% latency reduction on decode steps:
cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128]  # Pre-capture these sizes
CUDA graphs increase memory usage (~500MB per batch size). Reduce batch_sizes if memory-constrained.

Overlap Scheduler

Enable compute/communication overlap (PyTorch backend only):
pytorch_backend_config:
  enable_overlap_scheduler: true
Overlap scheduler can improve throughput by 10-15% for multi-GPU deployments.

Monitoring and Observability

Metrics to Track

  • Time to First Token (TTFT): Prefill latency
  • Time Per Output Token (TPOT): Decode latency
  • Request throughput: Requests/second
  • Token throughput: Tokens/second
  • Queue time: Time waiting in scheduler
  • GPU utilization: Target >80%
  • GPU memory usage: Monitor for OOM
  • KV cache hit rate: Higher = better efficiency
  • Active requests: Current concurrency
  • Batch sizes: Average batch utilization
  • Request failures: HTTP 5xx errors
  • OOM errors: KV cache exhaustion
  • Timeout errors: Requests exceeding max wait time

Collecting Metrics

import requests
import time

def monitor_metrics(interval=5):
    while True:
        response = requests.get("http://localhost:8000/metrics")
        stats = response.json()
        
        for stat in stats:
            print(f"Iter {stat['iter']}:")
            print(f"  GPU Memory: {stat['gpuMemUsage'] / 1e9:.2f} GB")
            print(f"  KV Cache Hit Rate: {stat['kvCacheStats']['cacheHitRate']:.2%}")
            print(f"  Active Requests: {stat['numActiveRequests']}")
            print(f"  Iteration Latency: {stat['iterLatencyMS']:.2f}ms")
        
        time.sleep(interval)

monitor_metrics()

OpenTelemetry Integration

Export traces to observability platforms:
trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --otlp_traces_endpoint http://localhost:4318/v1/traces
Requires OpenTelemetry Collector running. See OpenTelemetry docs for setup.

Scaling Strategies

Vertical Scaling (Single Node)

1

Use tensor parallelism for large models

# Split Llama-70B across 4 GPUs
trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
  --tp_size 4
2

Increase batch size and token limits

max_batch_size: 512
max_num_tokens: 32768
kv_cache_config:
  free_gpu_memory_fraction: 0.98
3

Enable all optimizations

cuda_graph_config:
  enable_padding: true
pytorch_backend_config:
  enable_overlap_scheduler: true
kv_cache_config:
  enable_block_reuse: true
  dtype: fp8

Horizontal Scaling (Multi-Instance)

upstream trtllm_backend {
    least_conn;  # Route to instance with fewest connections
    server localhost:8001;
    server localhost:8002;
    server localhost:8003;
    server localhost:8004;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://trtllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        
        # Timeouts for long generations
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Multi-Node Scaling

For models >70B parameters, use multi-node deployment:
# config.yml
cat > config.yml <<EOF
enable_attention_dp: true
pytorch_backend_config:
  enable_overlap_scheduler: true
EOF

# Slurm deployment (2 nodes, 16 GPUs total)
srun -N 2 \
  --ntasks 16 --ntasks-per-node=8 \
  --mpi=pmix --gres=gpu:8 \
  --container-image=nvcr.io/nvidia/tensorrt-llm:latest \
  bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 \
    --tp_size 16 \
    --ep_size 4 \
    --max_batch_size 161 \
    --config ./config.yml"

Security Best Practices

1

Disable trust_remote_code in production

trtllm-serve <model>  # trust_remote_code defaults to False
Only enable --trust_remote_code for models from trusted sources.
2

Use authentication/authorization

Deploy behind API gateway with auth:
location /v1 {
    auth_request /auth;
    proxy_pass http://trtllm_backend;
}

location = /auth {
    internal;
    proxy_pass http://auth-service/verify;
}
3

Rate limiting

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

location /v1 {
    limit_req zone=api_limit burst=20;
    proxy_pass http://trtllm_backend;
}
4

Input validation

Set reasonable limits:
max_seq_len: 8192      # Prevent excessive memory use
max_tokens: 2048       # Limit output length

High Availability

Health Checks

import requests
import time

def health_check(url: str, timeout: int = 5) -> bool:
    try:
        response = requests.get(f"{url}/health", timeout=timeout)
        return response.status_code == 200
    except:
        return False

# Kubernetes liveness probe
while True:
    if not health_check("http://localhost:8000"):
        print("Service unhealthy, restarting...")
        # Trigger restart
    time.sleep(30)

Graceful Shutdown

import signal
import sys
from tensorrt_llm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

def shutdown_handler(signum, frame):
    print("Shutting down gracefully...")
    llm.shutdown()  # Finish in-flight requests
    sys.exit(0)

signal.signal(signal.SIGTERM, shutdown_handler)
signal.signal(signal.SIGINT, shutdown_handler)

# Start serving...

Kubernetes Deployment

trtllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: trtllm-serve
spec:
  replicas: 3
  selector:
    matchLabels:
      app: trtllm
  template:
    metadata:
      labels:
        app: trtllm
    spec:
      containers:
      - name: trtllm
        image: nvcr.io/nvidia/tensorrt-llm:latest
        command: ["trtllm-serve"]
        args:
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--port=8000"
          - "--config=/config/config.yaml"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 32Gi
          requests:
            nvidia.com/gpu: 1
            memory: 16Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        volumeMounts:
        - name: config
          mountPath: /config
      volumes:
      - name: config
        configMap:
          name: trtllm-config
---
apiVersion: v1
kind: Service
metadata:
  name: trtllm-service
spec:
  selector:
    app: trtllm
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Troubleshooting

Symptoms: Requests fail with CUDA OOM errorsSolutions:
  1. Reduce max_batch_size or max_num_tokens
  2. Lower free_gpu_memory_fraction to 0.85
  3. Enable FP8 KV cache: kv_cache_config.dtype: fp8
  4. Disable CUDA graphs if enabled
  5. Use tensor parallelism for larger models
Symptoms: TTFT or TPOT higher than expectedSolutions:
  1. Enable CUDA graphs: use_cuda_graph: true
  2. Enable overlap scheduler (PyTorch): enable_overlap_scheduler: true
  3. Increase max_num_tokens to allow larger batches
  4. Check GPU utilization (should be >80%)
  5. Reduce tokens_per_block to 16 for short requests
Symptoms: Requests/second below expectationsSolutions:
  1. Increase max_batch_size
  2. Enable KV cache reuse: enable_block_reuse: true
  3. Use async generation: generate_async()
  4. Check queue time in metrics (should be less than 100ms)
  5. Scale horizontally with load balancer
Symptoms: Server fails to start, model download issuesSolutions:
  1. Check HuggingFace token: huggingface-cli login
  2. Pre-download model: huggingface-cli download <model>
  3. Verify disk space (models can be 50GB+)
  4. Check model compatibility with backend
  5. Enable --trust_remote_code if using custom model code

Performance Checklist

1

Choose optimal backend

✅ PyTorch for compatibility, TensorRT for max performance
2

Configure KV cache

free_gpu_memory_fraction: 0.95enable_block_reuse: truedtype: fp8 (on Hopper GPUs)
3

Set batch limits

max_batch_size: 256 (adjust for GPU memory) ✅ max_num_tokens: 16384
4

Enable optimizations

✅ CUDA graphs ✅ Overlap scheduler (PyTorch) ✅ Dynamic batching
5

Monitor metrics

✅ GPU utilization >80% ✅ KV cache hit rate >50% ✅ Queue time less than 100ms
6

Scale appropriately

✅ Tensor parallelism for large models ✅ Horizontal scaling for high traffic ✅ Load balancer for multi-instance

Next Steps

Distributed Inference

Multi-GPU and multi-node deployments

Benchmarking

Measure and optimize performance

Reference Configs

170+ optimized configurations

API Reference

Complete configuration reference

Build docs developers (and LLMs) love