Usage
Arguments
Model name, HuggingFace checkpoint path, or TensorRT engine path
Server Options
Hostname of the server
Port of the server
Backend to use for serving the model. Choices:
pytorch, tensorrt, trt (alias for tensorrt), _autodeployRun gRPC server instead of OpenAI HTTP server. gRPC server accepts pre-tokenized requests and returns raw token IDs
The model name used in the API. If not specified, the model path is used. Useful when the model path is long or when you want to expose a custom name to clients
Tokenizer Options
Path or name of the tokenizer. Defaults to model path if not specified
Custom tokenizer type: alias (e.g., ‘deepseek_v32’) or Python import path (e.g., ‘tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer’)
Model Configuration
Maximum number of requests that the engine can schedule
Maximum number of batched input tokens after padding is removed in each batch
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config
Maximum number of beams for beam search decoding
Parallelism Options
Tensor parallelism size. Alias:
--tp_sizePipeline parallelism size. Alias:
--pp_sizeContext parallelism size. Alias:
--cp_sizeExpert parallelism size for MoE models. Alias:
--ep_sizeExpert cluster parallelism size. Alias:
--cluster_sizeNumber of GPUs per node. Defaults to automatic detection
Memory Options
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers. Alias:
--kv_cache_free_gpu_memory_fractionKV cache quantization dtype for PyTorch backend. Choices:
auto, fp8, nvfp4. ‘auto’ uses checkpoint/model metadata; explicit values force overrideAdvanced Options
Path to a YAML file that overwrites the parameters specified by trtllm-serve. Can be specified as either
--config or --extra_llm_api_optionsFlag for HuggingFace transformers to trust remote code
The revision to use for the HuggingFace model (branch name, tag name, or commit id)
Number of workers to postprocess raw responses to comply with OpenAI protocol
Enable chunked prefill
Specify the parser for reasoning models
Specify the parser for tool models
Specify a custom chat template. Can be a file path or one-liner template string
Paths to custom module directories to import
Disaggregated Serving
Server role. Specify this value only if running in disaggregated mode
Path to metadata server config file for disaggregated serving
URI of the disaggregated cluster
Observability
Target URL to which OpenTelemetry traces will be sent
The logging level. Choices:
verbose, info, warning, error, internal_errorMultimodal Options
Keyword arguments for media I/O (JSON string)
Path to a YAML file with extra VISUAL_GEN model options (for diffusion models)
Runtime Behavior
Exit with runtime error when attention window is too large to fit even a single sequence in the KV cache
Examples
Basic HTTP Server
Serve a model from HuggingFace with default settings:Custom Port and Backend
Serve with TensorRT backend on a custom port:Multi-GPU Tensor Parallel
Serve with 4-way tensor parallelism:With Configuration File
Use a YAML configuration file for advanced settings:gRPC Server
Launch a gRPC server for pre-tokenized requests:Custom Model Name
Serve with a custom model name for API responses:High-Performance Configuration
Optimize for throughput with increased batch size and memory allocation:Multi-Node Serving
Serve across multiple nodes with tensor and pipeline parallelism:Related Commands
- trtllm-bench - Benchmark model performance
- trtllm-build - Build TensorRT engines
- trtllm-eval - Evaluate model accuracy