Skip to main content

Overview

The trtllm-serve command starts an OpenAI-compatible server that supports REST and gRPC endpoints. It provides the simplest way to deploy TensorRT-LLM models in production with minimal configuration.
trtllm-serve supports all three backends: PyTorch (default), TensorRT, and AutoDeploy.

Quick Start

Start a server with a HuggingFace model:
trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0
The server will start on localhost:8000 by default.

Command-Line Options

Basic Configuration

trtllm-serve <model> \
  --host localhost \
  --port 8000 \
  --backend pytorch \
  --tp_size 2 \
  --max_batch_size 128 \
  --max_num_tokens 8192
model
string
required
Model path or HuggingFace model name (e.g., meta-llama/Llama-3.1-8B-Instruct)
--host
string
default:"localhost"
Server hostname
--port
int
default:"8000"
Server port
--backend
string
default:"pytorch"
Inference backend: pytorch, tensorrt, or _autodeploy

Parallelism Configuration

--tp_size
int
default:"1"
Tensor parallelism size (split model across GPUs)
--pp_size
int
default:"1"
Pipeline parallelism size (split layers across GPUs)
--ep_size
int
Expert parallelism size for MoE models

Performance Tuning

--max_batch_size
int
default:"256"
Maximum number of concurrent requests
--max_num_tokens
int
default:"8192"
Maximum tokens across all requests in a batch
--max_seq_len
int
Maximum sequence length (prompt + generation). Auto-detected from model config if not specified.
--free_gpu_memory_fraction
float
default:"0.9"
Fraction of GPU memory to use for KV cache

OpenAI-Compatible Endpoints

The server exposes these OpenAI-compatible endpoints:

Chat Completions

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Where is New York?"}
    ],
    "max_tokens": 128,
    "temperature": 0.7
  }'

Text Completions

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "prompt": "The capital of France is",
    "max_tokens": 64,
    "temperature": 0
  }'

Streaming Responses

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

stream = client.chat.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    messages=[{"role": "user", "content": "Tell me a story"}],
    max_tokens=512,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Advanced Configuration with YAML

For complex configurations, use a YAML file with --config:
config.yaml
max_batch_size: 128
max_num_tokens: 8192

kv_cache_config:
  free_gpu_memory_fraction: 0.95
  enable_block_reuse: true
  dtype: fp8

pytorch_backend_config:
  enable_overlap_scheduler: true

moe_config:
  backend: CUTLASS

cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32]
Start the server with the config:
trtllm-serve meta-llama/Llama-3.1-8B-Instruct --config config.yaml
The YAML file mirrors the structure of TorchLlmArgs. All nested configuration classes can be specified.

Multi-Node Deployment with Slurm

Deploy large models across multiple nodes using Slurm:
cat > config.yml <<EOF
enable_attention_dp: true
pytorch_backend_config:
  enable_overlap_scheduler: true
EOF

srun -N 2 \
  --ntasks 16 --ntasks-per-node=8 \
  --mpi=pmix --gres=gpu:8 \
  --container-image=nvcr.io/nvidia/tensorrt-llm:latest \
  bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 \
    --max_batch_size 161 \
    --max_num_tokens 1160 \
    --tp_size 16 \
    --ep_size 4 \
    --kv_cache_free_gpu_memory_fraction 0.95 \
    --config ./config.yml"
trtllm-llmapi-launch is a wrapper script that handles MPI initialization for multi-node deployments.

gRPC Server Mode

For high-performance use cases with external routers (e.g., sgl-router), use gRPC mode:
trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --port 50051
gRPC mode accepts pre-tokenized inputs and returns raw token IDs. It does not support --tool_parser, --chat_template, or disaggregated serving.

Monitoring and Metrics

Health Endpoint

curl http://localhost:8000/health

Metrics Endpoint

Enable performance metrics in your config:
enable_iter_perf_stats: true
Query runtime statistics:
curl http://localhost:8000/metrics
[
  {
    "gpuMemUsage": 76665782272,
    "iter": 154,
    "iterLatencyMS": 7.0,
    "kvCacheStats": {
      "allocNewBlocks": 3126,
      "cacheHitRate": 0.00128,
      "freeNumBlocks": 101253,
      "maxNumBlocks": 101256,
      "tokensPerBlock": 32,
      "usedNumBlocks": 3
    },
    "numActiveRequests": 1
  }
]
Metrics are stored in a queue and removed once retrieved. Poll regularly if you need to retain metrics.

Custom Tokenizers

Use custom tokenizers for specialized models:
trtllm-serve deepseek-ai/DeepSeek-V3 \
  --custom_tokenizer deepseek_v32
Or specify a Python import path:
trtllm-serve deepseek-ai/DeepSeek-V3 \
  --custom_tokenizer tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer

Multimodal Models

For vision-language models (VLMs), disable KV cache reuse:
vlm-config.yaml
kv_cache_config:
  enable_block_reuse: false
trtllm-serve Qwen/Qwen2-VL-7B-Instruct --config vlm-config.yaml
Send multimodal requests:
from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

with open("image.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Qwen2-VL-7B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
        ]
    }]
)

print(response.choices[0].message.content)

Production Best Practices

1

Use YAML configuration files

Store configurations in version-controlled YAML files instead of long command lines.
2

Enable KV cache reuse

Set enable_block_reuse: true in kv_cache_config for improved throughput with repetitive prompts.
3

Tune batch size and token limits

Adjust max_batch_size and max_num_tokens based on your GPU memory and workload.
4

Monitor metrics

Enable enable_iter_perf_stats and poll /metrics to track GPU utilization and KV cache efficiency.
5

Use custom model names

Specify --served_model_name to expose a user-friendly model name in the API instead of the path.

Next Steps

LLM API

Use the Python LLM API for programmatic access

Distributed Inference

Scale to multi-GPU and multi-node deployments

Production Guide

Best practices for production deployments

Disaggregated Serving

Optimize TTFT and TPOT independently

Build docs developers (and LLMs) love