trtllm-serve

Launch an OpenAI-compatible API server for serving TensorRT-LLM models. Supports PyTorch, TensorRT, and AutoDeploy backends with both HTTP (REST) and gRPC protocols.

Usage

trtllm-serve MODEL [OPTIONS]

Arguments

MODEL

string

required

Model name, HuggingFace checkpoint path, or TensorRT engine path

Server Options

--host

string

default:"localhost"

Hostname of the server

--port

integer

default:"8000"

Port of the server

--backend

string

default:"pytorch"

Backend to use for serving the model. Choices: pytorch, tensorrt, trt (alias for tensorrt), _autodeploy

--grpc

boolean

default:"false"

Run gRPC server instead of OpenAI HTTP server. gRPC server accepts pre-tokenized requests and returns raw token IDs

--served_model_name

string

The model name used in the API. If not specified, the model path is used. Useful when the model path is long or when you want to expose a custom name to clients

Tokenizer Options

--tokenizer

string

Path or name of the tokenizer. Defaults to model path if not specified

--custom_tokenizer

string

Custom tokenizer type: alias (e.g., ‘deepseek_v32’) or Python import path (e.g., ‘tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer’)

Model Configuration

--max_batch_size

integer

default:"8"

Maximum number of requests that the engine can schedule

--max_num_tokens

integer

default:"2048"

Maximum number of batched input tokens after padding is removed in each batch

--max_seq_len

integer

Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config

--max_beam_width

integer

default:"1"

Maximum number of beams for beam search decoding

Parallelism Options

--tensor_parallel_size

integer

default:"1"

Tensor parallelism size. Alias: --tp_size

--pipeline_parallel_size

integer

default:"1"

Pipeline parallelism size. Alias: --pp_size

--context_parallel_size

integer

default:"1"

Context parallelism size. Alias: --cp_size

--moe_expert_parallel_size

integer

Expert parallelism size for MoE models. Alias: --ep_size

--moe_cluster_parallel_size

integer

Expert cluster parallelism size. Alias: --cluster_size

--gpus_per_node

integer

Number of GPUs per node. Defaults to automatic detection

Memory Options

--free_gpu_memory_fraction

float

default:"0.9"

Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers. Alias: --kv_cache_free_gpu_memory_fraction

--kv_cache_dtype

string

default:"auto"

KV cache quantization dtype for PyTorch backend. Choices: auto, fp8, nvfp4. ‘auto’ uses checkpoint/model metadata; explicit values force override

Advanced Options

--config

string

Path to a YAML file that overwrites the parameters specified by trtllm-serve. Can be specified as either --config or --extra_llm_api_options

--trust_remote_code

boolean

default:"false"

Flag for HuggingFace transformers to trust remote code

--revision

string

The revision to use for the HuggingFace model (branch name, tag name, or commit id)

--num_postprocess_workers

integer

default:"0"

Number of workers to postprocess raw responses to comply with OpenAI protocol

--enable_chunked_prefill

boolean

default:"false"

Enable chunked prefill

--reasoning_parser

string

Specify the parser for reasoning models

--tool_parser

string

Specify the parser for tool models

--chat_template

string

Specify a custom chat template. Can be a file path or one-liner template string

--custom_module_dirs

string

Paths to custom module directories to import

Disaggregated Serving

--server_role

string

Server role. Specify this value only if running in disaggregated mode

--metadata_server_config_file

string

Path to metadata server config file for disaggregated serving

--disagg_cluster_uri

string

URI of the disaggregated cluster

Observability

--otlp_traces_endpoint

string

Target URL to which OpenTelemetry traces will be sent

--log_level

string

default:"info"

The logging level. Choices: verbose, info, warning, error, internal_error

Multimodal Options

--media_io_kwargs

string

Keyword arguments for media I/O (JSON string)

--extra_visual_gen_options

string

Path to a YAML file with extra VISUAL_GEN model options (for diffusion models)

Runtime Behavior

--fail_fast_on_attention_window_too_large

boolean

default:"false"

Exit with runtime error when attention window is too large to fit even a single sequence in the KV cache

Examples

Basic HTTP Server

Serve a model from HuggingFace with default settings:

trtllm-serve meta-llama/Llama-3.1-8B-Instruct

Custom Port and Backend

Serve with TensorRT backend on a custom port:

trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --backend tensorrt \
  --port 8080

Multi-GPU Tensor Parallel

Serve with 4-way tensor parallelism:

trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor_parallel_size 4

With Configuration File

Use a YAML configuration file for advanced settings:

trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --config server_config.yaml

gRPC Server

Launch a gRPC server for pre-tokenized requests:

trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --port 50051

Custom Model Name

Serve with a custom model name for API responses:

trtllm-serve /path/to/local/model \
  --served_model_name "my-custom-llama"

High-Performance Configuration

Optimize for throughput with increased batch size and memory allocation:

trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --backend tensorrt \
  --max_batch_size 256 \
  --max_num_tokens 8192 \
  --free_gpu_memory_fraction 0.95 \
  --enable_chunked_prefill

Multi-Node Serving

Serve across multiple nodes with tensor and pipeline parallelism:

trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor_parallel_size 4 \
  --pipeline_parallel_size 2 \
  --gpus_per_node 8

trtllm-bench - Benchmark model performance
trtllm-build - Build TensorRT engines
trtllm-eval - Evaluate model accuracy

Python API

CLI Tools

Configuration

Usage

Arguments

Server Options

Tokenizer Options

Model Configuration

Parallelism Options

Memory Options

Advanced Options

Disaggregated Serving

Observability

Multimodal Options

Runtime Behavior

Examples

Basic HTTP Server

Custom Port and Backend

Multi-GPU Tensor Parallel

With Configuration File

gRPC Server

Custom Model Name

High-Performance Configuration

Multi-Node Serving

Build docs developers (and LLMs) love

Python API

CLI Tools

Configuration

​Usage

​Arguments

​Server Options

​Tokenizer Options

​Model Configuration

​Parallelism Options

​Memory Options

​Advanced Options

​Disaggregated Serving

​Observability

​Multimodal Options

​Runtime Behavior

​Examples

​Basic HTTP Server

​Custom Port and Backend

​Multi-GPU Tensor Parallel

​With Configuration File

​gRPC Server

​Custom Model Name

​High-Performance Configuration

​Multi-Node Serving

​Related Commands

Build docs developers (and LLMs) love

Usage

Arguments

Server Options

Tokenizer Options

Model Configuration

Parallelism Options

Memory Options

Advanced Options

Disaggregated Serving

Observability

Multimodal Options

Runtime Behavior

Examples

Basic HTTP Server

Custom Port and Backend

Multi-GPU Tensor Parallel

With Configuration File

gRPC Server

Custom Model Name

High-Performance Configuration

Multi-Node Serving

Related Commands