OpenAI-Compatible Server

Overview

The trtllm-serve command starts an OpenAI-compatible server that supports REST and gRPC endpoints. It provides the simplest way to deploy TensorRT-LLM models in production with minimal configuration.

trtllm-serve supports all three backends: PyTorch (default), TensorRT, and AutoDeploy.

Quick Start

Start a server with a HuggingFace model:

trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0

The server will start on localhost:8000 by default.

Command-Line Options

Basic Configuration

trtllm-serve <model> \
  --host localhost \
  --port 8000 \
  --backend pytorch \
  --tp_size 2 \
  --max_batch_size 128 \
  --max_num_tokens 8192

model

string

required

Model path or HuggingFace model name (e.g., meta-llama/Llama-3.1-8B-Instruct)

--host

string

default:"localhost"

Server hostname

--port

int

default:"8000"

Server port

--backend

string

default:"pytorch"

Inference backend: pytorch, tensorrt, or _autodeploy

Parallelism Configuration

--tp_size

int

default:"1"

Tensor parallelism size (split model across GPUs)

--pp_size

int

default:"1"

Pipeline parallelism size (split layers across GPUs)

--ep_size

int

Expert parallelism size for MoE models

Performance Tuning

--max_batch_size

int

default:"256"

Maximum number of concurrent requests

--max_num_tokens

int

default:"8192"

Maximum tokens across all requests in a batch

--max_seq_len

int

Maximum sequence length (prompt + generation). Auto-detected from model config if not specified.

--free_gpu_memory_fraction

float

default:"0.9"

Fraction of GPU memory to use for KV cache

OpenAI-Compatible Endpoints

The server exposes these OpenAI-compatible endpoints:

Chat Completions

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Where is New York?"}
    ],
    "max_tokens": 128,
    "temperature": 0.7
  }'

Text Completions

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "prompt": "The capital of France is",
    "max_tokens": 64,
    "temperature": 0
  }'

Streaming Responses

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

stream = client.chat.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    messages=[{"role": "user", "content": "Tell me a story"}],
    max_tokens=512,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Advanced Configuration with YAML

For complex configurations, use a YAML file with --config:

config.yaml

max_batch_size: 128
max_num_tokens: 8192

kv_cache_config:
  free_gpu_memory_fraction: 0.95
  enable_block_reuse: true
  dtype: fp8

pytorch_backend_config:
  enable_overlap_scheduler: true

moe_config:
  backend: CUTLASS

cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32]

Start the server with the config:

trtllm-serve meta-llama/Llama-3.1-8B-Instruct --config config.yaml

The YAML file mirrors the structure of TorchLlmArgs. All nested configuration classes can be specified.

Multi-Node Deployment with Slurm

Deploy large models across multiple nodes using Slurm:

cat > config.yml <<EOF
enable_attention_dp: true
pytorch_backend_config:
  enable_overlap_scheduler: true
EOF

srun -N 2 \
  --ntasks 16 --ntasks-per-node=8 \
  --mpi=pmix --gres=gpu:8 \
  --container-image=nvcr.io/nvidia/tensorrt-llm:latest \
  bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 \
    --max_batch_size 161 \
    --max_num_tokens 1160 \
    --tp_size 16 \
    --ep_size 4 \
    --kv_cache_free_gpu_memory_fraction 0.95 \
    --config ./config.yml"

trtllm-llmapi-launch is a wrapper script that handles MPI initialization for multi-node deployments.

gRPC Server Mode

For high-performance use cases with external routers (e.g., sgl-router), use gRPC mode:

trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --port 50051

gRPC mode accepts pre-tokenized inputs and returns raw token IDs. It does not support --tool_parser, --chat_template, or disaggregated serving.

Monitoring and Metrics

Health Endpoint

curl http://localhost:8000/health

Metrics Endpoint

Enable performance metrics in your config:

enable_iter_perf_stats: true

Query runtime statistics:

curl http://localhost:8000/metrics

Example Metrics Response

[
  {
    "gpuMemUsage": 76665782272,
    "iter": 154,
    "iterLatencyMS": 7.0,
    "kvCacheStats": {
      "allocNewBlocks": 3126,
      "cacheHitRate": 0.00128,
      "freeNumBlocks": 101253,
      "maxNumBlocks": 101256,
      "tokensPerBlock": 32,
      "usedNumBlocks": 3
    },
    "numActiveRequests": 1
  }
]

Metrics are stored in a queue and removed once retrieved. Poll regularly if you need to retain metrics.

Custom Tokenizers

Use custom tokenizers for specialized models:

trtllm-serve deepseek-ai/DeepSeek-V3 \
  --custom_tokenizer deepseek_v32

Or specify a Python import path:

trtllm-serve deepseek-ai/DeepSeek-V3 \
  --custom_tokenizer tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer

Multimodal Models

For vision-language models (VLMs), disable KV cache reuse:

vlm-config.yaml

kv_cache_config:
  enable_block_reuse: false

trtllm-serve Qwen/Qwen2-VL-7B-Instruct --config vlm-config.yaml

Send multimodal requests:

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

with open("image.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Qwen2-VL-7B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
        ]
    }]
)

print(response.choices[0].message.content)

Production Best Practices

Use YAML configuration files

Store configurations in version-controlled YAML files instead of long command lines.

Enable KV cache reuse

Set enable_block_reuse: true in kv_cache_config for improved throughput with repetitive prompts.

Tune batch size and token limits

Adjust max_batch_size and max_num_tokens based on your GPU memory and workload.

Monitor metrics

Enable enable_iter_perf_stats and poll /metrics to track GPU utilization and KV cache efficiency.

Use custom model names

Specify --served_model_name to expose a user-friendly model name in the API instead of the path.

Next Steps

LLM API

Use the Python LLM API for programmatic access

Distributed Inference

Scale to multi-GPU and multi-node deployments

Production Guide

Best practices for production deployments

Disaggregated Serving

Optimize TTFT and TPOT independently

Get Started

Core Concepts

Deployment

Models

Features

Performance

OpenAI-Compatible Server

Overview

Quick Start

Command-Line Options

Basic Configuration

Parallelism Configuration

Performance Tuning

OpenAI-Compatible Endpoints

Chat Completions

Text Completions

Streaming Responses

Advanced Configuration with YAML

Multi-Node Deployment with Slurm

gRPC Server Mode

Monitoring and Metrics

Health Endpoint

Metrics Endpoint

Custom Tokenizers

Multimodal Models

Production Best Practices

Next Steps

LLM API

Distributed Inference

Production Guide

Disaggregated Serving

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

Models

Features

Performance

​Overview

​Quick Start

​Command-Line Options

​Basic Configuration

​Parallelism Configuration

​Performance Tuning

​OpenAI-Compatible Endpoints

​Chat Completions

​Text Completions

​Streaming Responses

​Advanced Configuration with YAML

​Multi-Node Deployment with Slurm

​gRPC Server Mode

​Monitoring and Metrics

​Health Endpoint

​Metrics Endpoint

​Custom Tokenizers

​Multimodal Models

​Production Best Practices

​Next Steps

LLM API

Distributed Inference

Production Guide

Disaggregated Serving

Build docs developers (and LLMs) love

Overview

Quick Start

Command-Line Options

Basic Configuration

Parallelism Configuration

Performance Tuning

OpenAI-Compatible Endpoints

Chat Completions

Text Completions

Streaming Responses

Advanced Configuration with YAML

Multi-Node Deployment with Slurm

gRPC Server Mode

Monitoring and Metrics

Health Endpoint

Metrics Endpoint

Custom Tokenizers

Multimodal Models

Production Best Practices

Next Steps