Production Deployment

Overview

This guide covers best practices, performance tuning, monitoring, and scaling strategies for production TensorRT-LLM deployments.

Architecture Considerations

Choosing a Backend

PyTorch Backend

Default choice

Best compatibility
Active development
Full feature support
Easier debugging

TensorRT Backend

Maximum performance

Lowest latency
Highest throughput
Requires build step
Limited to specific models

AutoDeploy Backend

Experimental

Automatic optimization
On-the-fly quantization
Beta stability

Start with PyTorch backend for development and production. Switch to TensorRT backend only if you need absolute maximum performance and your model is supported.

Deployment Patterns

Single Server
Load Balanced Fleet
Disaggregated

Best for:

Low to medium traffic
Development/testing
Models under 70B parameters on single GPU

trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --max_batch_size 128 \
  --max_num_tokens 8192

Best for:

High traffic
Horizontal scaling
Fault tolerance

Deploy multiple instances behind NGINX/HAProxy:

# Instance 1 (GPU 0)
CUDA_VISIBLE_DEVICES=0 trtllm-serve model --port 8001 &

# Instance 2 (GPU 1)
CUDA_VISIBLE_DEVICES=1 trtllm-serve model --port 8002 &

# Load balancer distributes requests

Performance Tuning

Memory Management

Optimize KV cache allocation

config.yaml

kv_cache_config:
  free_gpu_memory_fraction: 0.95  # Use 95% of free memory
  enable_block_reuse: true        # Reuse cache blocks
  tokens_per_block: 32            # Larger blocks = less overhead

Setting free_gpu_memory_fraction too high may cause OOM errors during traffic spikes.

Enable FP8 KV cache

kv_cache_config:
  dtype: fp8  # 2x memory savings vs FP16

FP8 KV cache requires Hopper GPUs (H100, H200) for best performance.

Monitor cache efficiency

curl http://localhost:8000/metrics | jq '.[] | .kvCacheStats'

Target metrics:

Cache hit rate: >50% for similar prompts
Free blocks: >10% of max blocks
Tokens per block: 32 (default)

Batching Configuration

config.yaml

max_batch_size: 256        # Max concurrent requests
max_num_tokens: 16384      # Max tokens across batch

scheduler_config:
  capacity_scheduler_policy: GUARANTEED_NO_EVICT
  dynamic_batch_config:
    enable_batch_size_tuning: true
    enable_max_num_tokens_tuning: false
    dynamic_batch_moving_average_window: 128

Tuning Guidelines

Workload	`max_batch_size`	`max_num_tokens`	Notes
Short prompts + outputs	512	8192	Maximize throughput
Long prompts	128	32768	Prevent OOM
Streaming responses	256	16384	Balance latency/throughput
Mixed workload	256	16384	Safe defaults

CUDA Graphs

Enable CUDA graphs for 20-30% latency reduction on decode steps:

cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128]  # Pre-capture these sizes

CUDA graphs increase memory usage (~500MB per batch size). Reduce batch_sizes if memory-constrained.

Overlap Scheduler

Enable compute/communication overlap (PyTorch backend only):

pytorch_backend_config:
  enable_overlap_scheduler: true

Overlap scheduler can improve throughput by 10-15% for multi-GPU deployments.

Monitoring and Observability

Metrics to Track

Request Metrics

Time to First Token (TTFT): Prefill latency
Time Per Output Token (TPOT): Decode latency
Request throughput: Requests/second
Token throughput: Tokens/second
Queue time: Time waiting in scheduler

System Metrics

GPU utilization: Target >80%
GPU memory usage: Monitor for OOM
KV cache hit rate: Higher = better efficiency
Active requests: Current concurrency
Batch sizes: Average batch utilization

Error Metrics

Request failures: HTTP 5xx errors
OOM errors: KV cache exhaustion
Timeout errors: Requests exceeding max wait time

Collecting Metrics

import requests
import time

def monitor_metrics(interval=5):
    while True:
        response = requests.get("http://localhost:8000/metrics")
        stats = response.json()
        
        for stat in stats:
            print(f"Iter {stat['iter']}:")
            print(f"  GPU Memory: {stat['gpuMemUsage'] / 1e9:.2f} GB")
            print(f"  KV Cache Hit Rate: {stat['kvCacheStats']['cacheHitRate']:.2%}")
            print(f"  Active Requests: {stat['numActiveRequests']}")
            print(f"  Iteration Latency: {stat['iterLatencyMS']:.2f}ms")
        
        time.sleep(interval)

monitor_metrics()

OpenTelemetry Integration

Export traces to observability platforms:

trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --otlp_traces_endpoint http://localhost:4318/v1/traces

Requires OpenTelemetry Collector running. See OpenTelemetry docs for setup.

Scaling Strategies

Vertical Scaling (Single Node)

Use tensor parallelism for large models

# Split Llama-70B across 4 GPUs
trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
  --tp_size 4

Increase batch size and token limits

max_batch_size: 512
max_num_tokens: 32768
kv_cache_config:
  free_gpu_memory_fraction: 0.98

Enable all optimizations

cuda_graph_config:
  enable_padding: true
pytorch_backend_config:
  enable_overlap_scheduler: true
kv_cache_config:
  enable_block_reuse: true
  dtype: fp8

Horizontal Scaling (Multi-Instance)

upstream trtllm_backend {
    least_conn;  # Route to instance with fewest connections
    server localhost:8001;
    server localhost:8002;
    server localhost:8003;
    server localhost:8004;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://trtllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        
        # Timeouts for long generations
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Multi-Node Scaling

For models >70B parameters, use multi-node deployment:

# config.yml
cat > config.yml <<EOF
enable_attention_dp: true
pytorch_backend_config:
  enable_overlap_scheduler: true
EOF

# Slurm deployment (2 nodes, 16 GPUs total)
srun -N 2 \
  --ntasks 16 --ntasks-per-node=8 \
  --mpi=pmix --gres=gpu:8 \
  --container-image=nvcr.io/nvidia/tensorrt-llm:latest \
  bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 \
    --tp_size 16 \
    --ep_size 4 \
    --max_batch_size 161 \
    --config ./config.yml"

Security Best Practices

Disable trust_remote_code in production

trtllm-serve <model>  # trust_remote_code defaults to False

Only enable --trust_remote_code for models from trusted sources.

Use authentication/authorization

Deploy behind API gateway with auth:

location /v1 {
    auth_request /auth;
    proxy_pass http://trtllm_backend;
}

location = /auth {
    internal;
    proxy_pass http://auth-service/verify;
}

Rate limiting

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

location /v1 {
    limit_req zone=api_limit burst=20;
    proxy_pass http://trtllm_backend;
}

Input validation

Set reasonable limits:

max_seq_len: 8192      # Prevent excessive memory use
max_tokens: 2048       # Limit output length

High Availability

Health Checks

import requests
import time

def health_check(url: str, timeout: int = 5) -> bool:
    try:
        response = requests.get(f"{url}/health", timeout=timeout)
        return response.status_code == 200
    except:
        return False

# Kubernetes liveness probe
while True:
    if not health_check("http://localhost:8000"):
        print("Service unhealthy, restarting...")
        # Trigger restart
    time.sleep(30)

Graceful Shutdown

import signal
import sys
from tensorrt_llm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

def shutdown_handler(signum, frame):
    print("Shutting down gracefully...")
    llm.shutdown()  # Finish in-flight requests
    sys.exit(0)

signal.signal(signal.SIGTERM, shutdown_handler)
signal.signal(signal.SIGINT, shutdown_handler)

# Start serving...

Kubernetes Deployment

trtllm-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: trtllm-serve
spec:
  replicas: 3
  selector:
    matchLabels:
      app: trtllm
  template:
    metadata:
      labels:
        app: trtllm
    spec:
      containers:
      - name: trtllm
        image: nvcr.io/nvidia/tensorrt-llm:latest
        command: ["trtllm-serve"]
        args:
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--port=8000"
          - "--config=/config/config.yaml"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 32Gi
          requests:
            nvidia.com/gpu: 1
            memory: 16Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        volumeMounts:
        - name: config
          mountPath: /config
      volumes:
      - name: config
        configMap:
          name: trtllm-config
---
apiVersion: v1
kind: Service
metadata:
  name: trtllm-service
spec:
  selector:
    app: trtllm
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Troubleshooting

Out of Memory (OOM) Errors

Symptoms: Requests fail with CUDA OOM errorsSolutions:

Reduce max_batch_size or max_num_tokens
Lower free_gpu_memory_fraction to 0.85
Enable FP8 KV cache: kv_cache_config.dtype: fp8
Disable CUDA graphs if enabled
Use tensor parallelism for larger models

High Latency

Symptoms: TTFT or TPOT higher than expectedSolutions:

Enable CUDA graphs: use_cuda_graph: true
Enable overlap scheduler (PyTorch): enable_overlap_scheduler: true
Increase max_num_tokens to allow larger batches
Check GPU utilization (should be >80%)
Reduce tokens_per_block to 16 for short requests

Low Throughput

Symptoms: Requests/second below expectationsSolutions:

Increase max_batch_size
Enable KV cache reuse: enable_block_reuse: true
Use async generation: generate_async()
Check queue time in metrics (should be less than 100ms)
Scale horizontally with load balancer

Model Not Loading

Symptoms: Server fails to start, model download issuesSolutions:

Check HuggingFace token: huggingface-cli login
Pre-download model: huggingface-cli download <model>
Verify disk space (models can be 50GB+)
Check model compatibility with backend
Enable --trust_remote_code if using custom model code

Performance Checklist

Choose optimal backend

✅ PyTorch for compatibility, TensorRT for max performance

Configure KV cache

✅ free_gpu_memory_fraction: 0.95 ✅ enable_block_reuse: true ✅ dtype: fp8 (on Hopper GPUs)

Set batch limits

✅ max_batch_size: 256 (adjust for GPU memory) ✅ max_num_tokens: 16384

Enable optimizations

✅ CUDA graphs ✅ Overlap scheduler (PyTorch) ✅ Dynamic batching

Monitor metrics

✅ GPU utilization >80% ✅ KV cache hit rate >50% ✅ Queue time less than 100ms

Scale appropriately

✅ Tensor parallelism for large models ✅ Horizontal scaling for high traffic ✅ Load balancer for multi-instance

Next Steps

Distributed Inference

Multi-GPU and multi-node deployments

Benchmarking

Measure and optimize performance

Reference Configs

170+ optimized configurations

API Reference

Complete configuration reference

Get Started

Core Concepts

Deployment

Models

Features

Performance

Overview

Architecture Considerations

Choosing a Backend

PyTorch Backend

TensorRT Backend

AutoDeploy Backend

Deployment Patterns

Performance Tuning

Memory Management

Batching Configuration

CUDA Graphs

Overlap Scheduler

Monitoring and Observability

Metrics to Track

Collecting Metrics

OpenTelemetry Integration

Scaling Strategies

Vertical Scaling (Single Node)

Horizontal Scaling (Multi-Instance)

Multi-Node Scaling

Security Best Practices

High Availability

Health Checks

Graceful Shutdown

Kubernetes Deployment

Troubleshooting

Performance Checklist

Next Steps

Distributed Inference

Benchmarking

Reference Configs

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

Models

Features

Performance

​Overview

​Architecture Considerations

​Choosing a Backend

PyTorch Backend

TensorRT Backend

AutoDeploy Backend

​Deployment Patterns

​Performance Tuning

​Memory Management

​Batching Configuration

​CUDA Graphs

​Overlap Scheduler

​Monitoring and Observability

​Metrics to Track

​Collecting Metrics

​OpenTelemetry Integration

​Scaling Strategies

​Vertical Scaling (Single Node)

​Horizontal Scaling (Multi-Instance)

​Multi-Node Scaling

​Security Best Practices

​High Availability

​Health Checks

​Graceful Shutdown

​Kubernetes Deployment

​Troubleshooting

​Performance Checklist

​Next Steps

Distributed Inference

Benchmarking

Reference Configs

API Reference

Build docs developers (and LLMs) love

Overview

Architecture Considerations

Choosing a Backend

Deployment Patterns

Performance Tuning

Memory Management

Batching Configuration

CUDA Graphs

Overlap Scheduler

Monitoring and Observability

Metrics to Track

Collecting Metrics

OpenTelemetry Integration

Scaling Strategies

Vertical Scaling (Single Node)

Horizontal Scaling (Multi-Instance)

Multi-Node Scaling

Security Best Practices

High Availability

Health Checks

Graceful Shutdown

Kubernetes Deployment

Troubleshooting

Performance Checklist

Next Steps