LLM Class
TheLLM class is the primary interface for loading and running large language models with TensorRT-LLM. It supports both TensorRT and PyTorch backends.
Constructor
Parameters
Path to the model directory or Hugging Face model identifier. Can be:
- Local path to model checkpoint
- Hugging Face model ID (e.g.,
"meta-llama/Llama-3.1-8B-Instruct") - Path to pre-built TensorRT engine directory
Tokenizer to use for encoding/decoding. Can be:
- Path to tokenizer directory
- Hugging Face tokenizer identifier
- Pre-loaded tokenizer instance
- If
None, will attempt to load from model directory
Tokenizer loading mode:
'auto': Use fast tokenizer if available, otherwise slow'slow': Force use of slow Python tokenizer
Skip tokenizer initialization. Set to
True if you plan to work with token IDs directly without text.Whether to trust remote code when loading models from Hugging Face Hub. Required for some models with custom code.
Number of GPUs to use for tensor parallelism. Distributes model layers across multiple GPUs.
Data type for model weights. Options:
'auto': Automatically select based on model configuration'float16': 16-bit floating point'bfloat16': Brain floating point 16'float32': 32-bit floating point
Specific model revision to use (e.g., branch name, tag, or commit hash) when loading from Hugging Face Hub.
Specific tokenizer revision to use when loading from Hugging Face Hub.
Execution backend:
'tensorrt': TensorRT-optimized engine (legacy)'pytorch': PyTorch backend (default for most models)'_autodeploy': Automatic deployment backend (beta)
Methods
generate()
Generate completions synchronously for one or more prompts.Parameters
Input prompt(s). Can be:
- Single string:
"Hello, how are you?" - List of token IDs:
[1, 2, 3, 4] - Dict with
"prompt"or"prompt_token_ids"keys - List of any of the above for batch processing
Sampling parameters for generation. Can be a single
SamplingParams for all prompts or a list (one per prompt).Whether to display a progress bar during batch generation.
LoRA adapter to apply during generation.
Prompt adapter request for generation.
Configuration for retaining KV cache across requests.
Parameters for disaggregated serving (separating prefill and decode phases).
Advanced scheduling parameters.
Salt value to limit KV cache reuse to requests with the same salt.
Returns
Generation results. Returns a single
RequestOutput for single prompt input, or list for batch inputs.generate_async()
Generate completions asynchronously for a single prompt.Parameters
Input prompt. Can be text, token IDs, or a
PreprocessedInputs object returned from preprocess().Sampling parameters for generation.
LoRA adapter to apply during generation.
Prompt adapter request for generation.
Enable streaming mode. When
True, results can be iterated as they are generated.Configuration for retaining KV cache.
Parameters for disaggregated serving.
Advanced scheduling parameters.
Salt value to limit KV cache reuse.
Returns
A
RequestOutput future that can be awaited or iterated for streaming results.preprocess()
Preprocess prompts into token IDs and multimodal parameters without generating. This is useful for separating preprocessing (CPU-heavy) from generation (GPU-heavy).Parameters
Input prompt to preprocess.
Sampling parameters (needed for special token handling).
Disaggregated serving parameters.
Returns
Preprocessed inputs that can be passed to
generate_async().save()
Save the built TensorRT engine to disk (TensorRT backend only).Parameters
Directory path where the engine should be saved.
get_stats()
Get runtime statistics from completed requests.Parameters
Maximum time (in seconds) to wait for statistics.
Returns
List of runtime statistics dictionaries.
get_stats_async()
Get runtime statistics asynchronously as an async iterable.Parameters
Maximum time (in seconds) to wait for statistics.
Returns
Async iterable object containing runtime statistics.
shutdown()
Shutdown the LLM instance and release resources.Attributes
The tokenizer instance loaded by the LLM, or
None if skip_tokenizer_init=True.Temporary directory for intermediate files (TensorRT backend only).
Unique identifier for this LLM instance.
Disaggregated serving parameters for this instance.
Usage Examples
Basic Text Generation
Batch Generation
Streaming Generation
Multi-GPU Inference
Using PyTorch Backend
Context Manager
TheLLM class supports context manager protocol for automatic resource cleanup: