Skip to main content
Launch an OpenAI-compatible API server for serving TensorRT-LLM models. Supports PyTorch, TensorRT, and AutoDeploy backends with both HTTP (REST) and gRPC protocols.

Usage

trtllm-serve MODEL [OPTIONS]

Arguments

MODEL
string
required
Model name, HuggingFace checkpoint path, or TensorRT engine path

Server Options

--host
string
default:"localhost"
Hostname of the server
--port
integer
default:"8000"
Port of the server
--backend
string
default:"pytorch"
Backend to use for serving the model. Choices: pytorch, tensorrt, trt (alias for tensorrt), _autodeploy
--grpc
boolean
default:"false"
Run gRPC server instead of OpenAI HTTP server. gRPC server accepts pre-tokenized requests and returns raw token IDs
--served_model_name
string
The model name used in the API. If not specified, the model path is used. Useful when the model path is long or when you want to expose a custom name to clients

Tokenizer Options

--tokenizer
string
Path or name of the tokenizer. Defaults to model path if not specified
--custom_tokenizer
string
Custom tokenizer type: alias (e.g., ‘deepseek_v32’) or Python import path (e.g., ‘tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer’)

Model Configuration

--max_batch_size
integer
default:"8"
Maximum number of requests that the engine can schedule
--max_num_tokens
integer
default:"2048"
Maximum number of batched input tokens after padding is removed in each batch
--max_seq_len
integer
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config
--max_beam_width
integer
default:"1"
Maximum number of beams for beam search decoding

Parallelism Options

--tensor_parallel_size
integer
default:"1"
Tensor parallelism size. Alias: --tp_size
--pipeline_parallel_size
integer
default:"1"
Pipeline parallelism size. Alias: --pp_size
--context_parallel_size
integer
default:"1"
Context parallelism size. Alias: --cp_size
--moe_expert_parallel_size
integer
Expert parallelism size for MoE models. Alias: --ep_size
--moe_cluster_parallel_size
integer
Expert cluster parallelism size. Alias: --cluster_size
--gpus_per_node
integer
Number of GPUs per node. Defaults to automatic detection

Memory Options

--free_gpu_memory_fraction
float
default:"0.9"
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers. Alias: --kv_cache_free_gpu_memory_fraction
--kv_cache_dtype
string
default:"auto"
KV cache quantization dtype for PyTorch backend. Choices: auto, fp8, nvfp4. ‘auto’ uses checkpoint/model metadata; explicit values force override

Advanced Options

--config
string
Path to a YAML file that overwrites the parameters specified by trtllm-serve. Can be specified as either --config or --extra_llm_api_options
--trust_remote_code
boolean
default:"false"
Flag for HuggingFace transformers to trust remote code
--revision
string
The revision to use for the HuggingFace model (branch name, tag name, or commit id)
--num_postprocess_workers
integer
default:"0"
Number of workers to postprocess raw responses to comply with OpenAI protocol
--enable_chunked_prefill
boolean
default:"false"
Enable chunked prefill
--reasoning_parser
string
Specify the parser for reasoning models
--tool_parser
string
Specify the parser for tool models
--chat_template
string
Specify a custom chat template. Can be a file path or one-liner template string
--custom_module_dirs
string
Paths to custom module directories to import

Disaggregated Serving

--server_role
string
Server role. Specify this value only if running in disaggregated mode
--metadata_server_config_file
string
Path to metadata server config file for disaggregated serving
--disagg_cluster_uri
string
URI of the disaggregated cluster

Observability

--otlp_traces_endpoint
string
Target URL to which OpenTelemetry traces will be sent
--log_level
string
default:"info"
The logging level. Choices: verbose, info, warning, error, internal_error

Multimodal Options

--media_io_kwargs
string
Keyword arguments for media I/O (JSON string)
--extra_visual_gen_options
string
Path to a YAML file with extra VISUAL_GEN model options (for diffusion models)

Runtime Behavior

--fail_fast_on_attention_window_too_large
boolean
default:"false"
Exit with runtime error when attention window is too large to fit even a single sequence in the KV cache

Examples

Basic HTTP Server

Serve a model from HuggingFace with default settings:
trtllm-serve meta-llama/Llama-3.1-8B-Instruct

Custom Port and Backend

Serve with TensorRT backend on a custom port:
trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --backend tensorrt \
  --port 8080

Multi-GPU Tensor Parallel

Serve with 4-way tensor parallelism:
trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor_parallel_size 4

With Configuration File

Use a YAML configuration file for advanced settings:
trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --config server_config.yaml

gRPC Server

Launch a gRPC server for pre-tokenized requests:
trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --grpc \
  --port 50051

Custom Model Name

Serve with a custom model name for API responses:
trtllm-serve /path/to/local/model \
  --served_model_name "my-custom-llama"

High-Performance Configuration

Optimize for throughput with increased batch size and memory allocation:
trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --backend tensorrt \
  --max_batch_size 256 \
  --max_num_tokens 8192 \
  --free_gpu_memory_fraction 0.95 \
  --enable_chunked_prefill

Multi-Node Serving

Serve across multiple nodes with tensor and pipeline parallelism:
trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor_parallel_size 4 \
  --pipeline_parallel_size 2 \
  --gpus_per_node 8

Build docs developers (and LLMs) love