Distributed Inference - TensorRT-LLM

Overview

TensorRT-LLM supports multiple parallelism strategies for scaling inference:

Tensor Parallelism (TP): Split model weights across GPUs
Pipeline Parallelism (PP): Split layers across GPUs
Expert Parallelism (EP): Split experts in MoE models
Context Parallelism (CP): Split long sequences across GPUs
Disaggregated Serving: Separate prefill and decode phases

Tensor Parallelism

Split model layers horizontally across multiple GPUs. Best for models that don’t fit on a single GPU.

Single-Node Multi-GPU

from tensorrt_llm import LLM, SamplingParams

# Split model across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4
)

prompts = ["Hello", "The future of AI is"]
outputs = llm.generate(prompts, SamplingParams(max_tokens=128))

# Entry point must be protected for multi-GPU
if __name__ == '__main__':
    main()

Tensor parallelism requires GPUs on the same node with fast interconnects (NVLink/NVSwitch).

Communication Backends

TensorRT-LLM supports multiple orchestrators for multi-GPU communication:

MPI (Default)
Ray
RPC

Uses MPI for GPU communication. Best for performance.

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    orchestrator_type=None  # MPI is default
)

Uses Ray for distributed execution. Better for dynamic scaling.

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    orchestrator_type="ray"
)

Or force Ray globally:

export TLLM_DISABLE_MPI=1
python inference.py

Uses PyTorch RPC. Experimental.

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    orchestrator_type="rpc"
)

Pipeline Parallelism

Split model layers vertically across GPUs. Each GPU processes a subset of layers.

from tensorrt_llm import LLM

# 4-stage pipeline
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    pipeline_parallel_size=4
)

Pipeline parallelism has bubble overhead where GPUs wait for data. Use only when tensor parallelism isn’t sufficient.

Hybrid Parallelism

Combine tensor and pipeline parallelism:

# 2x4 = 8 GPUs total
llm = LLM(
    model="meta-llama/Llama-3.1-405B-Instruct",
    tensor_parallel_size=4,    # 4 GPUs per pipeline stage
    pipeline_parallel_size=2   # 2 pipeline stages
)

Formula: world_size = tp_size × pp_size

Expert Parallelism (MoE Models)

For Mixture-of-Experts models like Mixtral or DeepSeek-V3:

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import MoeConfig

llm = LLM(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    tensor_parallel_size=2,
    moe_expert_parallel_size=4,  # Split experts across 4 GPUs
    moe_config=MoeConfig(
        backend="CUTLASS"  # Optimized MoE kernel
    )
)

Total GPUs: tp_size × ep_size. For Mixtral-8x7B: 2 × 4 = 8 GPUs.

Context Parallelism

Split long sequences across GPUs using ring attention or Ulysses:

from tensorrt_llm import LLM
from tensorrt_llm.mapping import CpType

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    context_parallel_size=4,
    cp_config={
        "cp_type": CpType.ULYSSES  # or RING, STAR, HELIX
    }
)

Context Parallelism Types

Type	Description	Use Case
`ULYSSES`	Split sequence dimension	Long sequences (>32K tokens)
`RING`	Ring attention	Very long sequences (>128K)
`STAR`	Star attention	Extreme lengths (>1M tokens)
`HELIX`	Helix parallelism	MoE + context parallelism

Multi-Node Deployment

Prerequisites

Install MPI

# Install OpenMPI or use Slurm's PMI
apt-get install libopenmpi-dev openmpi-bin

Configure network

Ensure nodes can communicate:

# Test connectivity
mpirun -np 16 -H node1:8,node2:8 hostname

Set up shared storage

Models must be accessible from all nodes (NFS, S3, etc.)

Slurm Deployment

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00

# Config file
cat > config.yml <<EOF
enable_attention_dp: true
pytorch_backend_config:
  enable_overlap_scheduler: true
kv_cache_config:
  free_gpu_memory_fraction: 0.95
EOF

# Launch with trtllm-llmapi-launch wrapper
srun --mpi=pmix \
  --container-image=nvcr.io/nvidia/tensorrt-llm:latest \
  --container-mounts=/data:/data \
  bash -c "trtllm-llmapi-launch trtllm-serve \
    /data/models/deepseek-ai/DeepSeek-V3 \
    --tp_size 16 \
    --ep_size 4 \
    --max_batch_size 161 \
    --config ./config.yml"

trtllm-llmapi-launch handles MPI process spawning and GPU assignment automatically.

Manual MPI Launch

# 2 nodes, 8 GPUs each
mpirun -np 16 \
  -H node1:8,node2:8 \
  -x CUDA_VISIBLE_DEVICES \
  --bind-to none \
  python -m tensorrt_llm.commands.serve \
    meta-llama/Llama-3.1-405B-Instruct \
    --tp_size 16 \
    --backend pytorch

Disaggregated Serving

Separate prefill (context) and decode (generation) phases onto different GPU pools for independent optimization.

Why Disaggregated Serving?

Optimize TTFT

Dedicate GPUs to prefill with high parallelism for fast Time-to-First-Token

Optimize TPOT

Dedicate GPUs to decode with batching for low Time-Per-Output-Token

Prevent Interference

Prefill doesn’t delay token generation

Different GPU Types

Use H100 for prefill, L40 for decode

Architecture

Setup with trtllm-serve

Configure context servers

context-config.yml

disable_overlap_scheduler: true  # Not supported for context servers
cache_transceiver_config:
  backend: NIXL  # or UCX, MPI
  max_tokens_in_buffer: 8192

Start context servers

# Context server 1
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --host localhost --port 8001 \
  --backend pytorch \
  --config ./context-config.yml &

# Context server 2
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --host localhost --port 8002 \
  --backend pytorch \
  --config ./context-config.yml &

Configure generation server

gen-config.yml

cache_transceiver_config:
  backend: NIXL
  max_tokens_in_buffer: 8192

Start generation server

CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --host localhost --port 8003 \
  --backend pytorch \
  --config ./gen-config.yml &

Launch orchestrator

disagg-config.yml

hostname: localhost
port: 8000
backend: pytorch
context_servers:
  num_instances: 2
  urls:
    - "localhost:8001"
    - "localhost:8002"
generation_servers:
  num_instances: 1
  urls:
    - "localhost:8003"

trtllm-serve disaggregated -c disagg-config.yml

KV Cache Exchange Backends

NIXL (Recommended)
UCX
MPI

Default backend with dynamic scaling support.

cache_transceiver_config:
  backend: NIXL

Configure underlying protocol:

export TRTLLM_NIXL_KVCACHE_BACKEND=UCX  # or LIBFABRIC

Direct UCX backend for RDMA/NVLink.

cache_transceiver_config:
  backend: UCX
  max_tokens_in_buffer: 8192

MPI-based KV cache transfer.

cache_transceiver_config:
  backend: MPI
  max_tokens_in_buffer: 8192

max_tokens_in_buffer should be ≥ maximum input sequence length for optimal performance.

Client Usage

from openai import OpenAI

# Connect to disaggregated orchestrator
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=512
)

print(response.choices[0].message.content)

The orchestrator automatically:

Routes request to context server (prefill)
Transfers KV cache to generation server
Generation server produces tokens
Returns unified response

Performance Tuning

Overlap Scheduler (PyTorch)

Enable compute/communication overlap for multi-GPU:

pytorch_backend_config:
  enable_overlap_scheduler: true

Can improve throughput by 10-15% for TP ≥ 2.

Attention Data Parallelism

Enable for models with TP:

enable_attention_dp: true
attention_dp_config:
  enable_balance: true
  timeout_iters: 10
  batching_wait_iters: 5

NCCL Optimization

# Enable NVLink for NCCL
export NCCL_IB_DISABLE=0
export NCCL_NET_GDR_LEVEL=5

# Tune NCCL buffers
export NCCL_BUFFSIZE=8388608
export NCCL_P2P_LEVEL=NVL

Examples

Llama-70B on 4 GPUs

trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
  --tp_size 4 \
  --max_batch_size 128 \
  --max_num_tokens 16384 \
  --kv_cache_free_gpu_memory_fraction 0.95

Llama-405B on 8 GPUs (Hybrid)

trtllm-serve meta-llama/Llama-3.1-405B-Instruct \
  --tp_size 4 \
  --pp_size 2 \
  --max_batch_size 64 \
  --config config.yml

DeepSeek-V3 on 16 GPUs (2 Nodes)

srun -N 2 --ntasks 16 --mpi=pmix --gres=gpu:8 \
  --container-image=nvcr.io/nvidia/tensorrt-llm:latest \
  bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 \
    --tp_size 16 --ep_size 4 --config config.yml"

Mixtral-8x7B with Expert Parallelism

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import MoeConfig

llm = LLM(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    tensor_parallel_size=2,
    moe_expert_parallel_size=4,
    moe_config=MoeConfig(backend="CUTLASS")
)

Troubleshooting

MPI initialization fails

Error: MPI_Init failedSolutions:

Ensure MPI is installed: mpirun --version
Use Ray orchestrator: orchestrator_type="ray"
Set: export TLLM_DISABLE_MPI=1

NCCL errors

Error: NCCL error: unhandled system errorSolutions:

Check NCCL version: python -c "import torch; print(torch.cuda.nccl.version())"
Enable debug: export NCCL_DEBUG=INFO
Disable IB if not available: export NCCL_IB_DISABLE=1

Disaggregated KV cache transfer fails

Error: Failed to transfer KV cacheSolutions:

Increase max_tokens_in_buffer in config
Try different backend: NIXL → UCX → MPI
Check network connectivity between context and gen servers
Verify TRTLLM_NIXL_KVCACHE_BACKEND env var

Pipeline parallelism low throughput

Symptoms: Low GPU utilization with PPSolutions:

Prefer tensor parallelism over pipeline
Increase max_batch_size to fill pipeline bubbles
Use hybrid TP+PP only for very large models

Best Practices

Choose parallelism strategy

Single GPU: No parallelism
2-8 GPUs: Tensor parallelism
>8 GPUs: Hybrid TP + PP
MoE models: Expert parallelism
Long sequences: Context parallelism

Use NVLink/NVSwitch

TP requires fast GPU-to-GPU communication. Avoid PCIe-only setups.

Enable overlap scheduler

pytorch_backend_config:
  enable_overlap_scheduler: true

Monitor communication overhead

Check iteration latency in /metrics endpoint. High latency indicates communication bottleneck.

Use disaggregated serving strategically

Best for:

Long prompts (>4K tokens) + short outputs
Separate optimization of TTFT and TPOT
Different GPU types for prefill vs decode

Next Steps

Production Guide

Production deployment best practices

Benchmarking

Measure distributed performance

Reference Configs

170+ optimized configurations

Disaggregated Examples

Complete disaggregated serving examples

Get Started

Core Concepts

Deployment

Models

Features

Performance

​Overview

​Tensor Parallelism

​Single-Node Multi-GPU

​Communication Backends

​Pipeline Parallelism

​Hybrid Parallelism

​Expert Parallelism (MoE Models)

​Context Parallelism

​Multi-Node Deployment

​Prerequisites

​Slurm Deployment

​Manual MPI Launch

​Disaggregated Serving

​Why Disaggregated Serving?

Optimize TTFT

Optimize TPOT

Prevent Interference

Different GPU Types

​Architecture

​Setup with trtllm-serve

​KV Cache Exchange Backends

​Client Usage

​Performance Tuning

​Overlap Scheduler (PyTorch)

​Attention Data Parallelism

​NCCL Optimization

​Examples

​Llama-70B on 4 GPUs

​Llama-405B on 8 GPUs (Hybrid)

​DeepSeek-V3 on 16 GPUs (2 Nodes)

​Mixtral-8x7B with Expert Parallelism

​Troubleshooting

​Best Practices

​Next Steps

Production Guide

Benchmarking

Reference Configs

Disaggregated Examples

Build docs developers (and LLMs) love

Overview

Tensor Parallelism

Single-Node Multi-GPU

Communication Backends

Pipeline Parallelism

Hybrid Parallelism

Expert Parallelism (MoE Models)

Context Parallelism

Multi-Node Deployment

Prerequisites

Slurm Deployment

Manual MPI Launch

Disaggregated Serving

Why Disaggregated Serving?

Architecture

Setup with trtllm-serve

KV Cache Exchange Backends

Client Usage

Performance Tuning

Overlap Scheduler (PyTorch)

Attention Data Parallelism

NCCL Optimization

Examples

Llama-70B on 4 GPUs

Llama-405B on 8 GPUs (Hybrid)

DeepSeek-V3 on 16 GPUs (2 Nodes)

Mixtral-8x7B with Expert Parallelism

Troubleshooting

Best Practices

Next Steps