LLM Class

The LLM class is the primary interface for loading and running large language models with TensorRT-LLM. It supports both TensorRT and PyTorch backends.

Constructor

from tensorrt_llm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tokenizer="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    dtype="auto"
)

Parameters

model

str | Path

required

Path to the model directory or Hugging Face model identifier. Can be:

Local path to model checkpoint
Hugging Face model ID (e.g., "meta-llama/Llama-3.1-8B-Instruct")
Path to pre-built TensorRT engine directory

tokenizer

str | Path | TokenizerBase | PreTrainedTokenizerBase

default:"None"

Tokenizer to use for encoding/decoding. Can be:

Path to tokenizer directory
Hugging Face tokenizer identifier
Pre-loaded tokenizer instance
If None, will attempt to load from model directory

tokenizer_mode

Literal['auto', 'slow']

default:"'auto'"

Tokenizer loading mode:

'auto': Use fast tokenizer if available, otherwise slow
'slow': Force use of slow Python tokenizer

skip_tokenizer_init

bool

default:"False"

Skip tokenizer initialization. Set to True if you plan to work with token IDs directly without text.

trust_remote_code

bool

default:"False"

Whether to trust remote code when loading models from Hugging Face Hub. Required for some models with custom code.

tensor_parallel_size

int

default:"1"

Number of GPUs to use for tensor parallelism. Distributes model layers across multiple GPUs.

dtype

str

default:"'auto'"

Data type for model weights. Options:

'auto': Automatically select based on model configuration
'float16': 16-bit floating point
'bfloat16': Brain floating point 16
'float32': 32-bit floating point

revision

str

default:"None"

Specific model revision to use (e.g., branch name, tag, or commit hash) when loading from Hugging Face Hub.

tokenizer_revision

str

default:"None"

Specific tokenizer revision to use when loading from Hugging Face Hub.

backend

Literal['tensorrt', 'pytorch', '_autodeploy']

default:"None"

Execution backend:

'tensorrt': TensorRT-optimized engine (legacy)
'pytorch': PyTorch backend (default for most models)
'_autodeploy': Automatic deployment backend (beta)

Methods

generate()

Generate completions synchronously for one or more prompts.

outputs = llm.generate(
    inputs="Tell me about artificial intelligence",
    sampling_params=SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
)

Parameters

inputs

PromptInputs | Sequence[PromptInputs]

required

Input prompt(s). Can be:

Single string: "Hello, how are you?"
List of token IDs: [1, 2, 3, 4]
Dict with "prompt" or "prompt_token_ids" keys
List of any of the above for batch processing

sampling_params

SamplingParams | List[SamplingParams]

default:"None"

Sampling parameters for generation. Can be a single SamplingParams for all prompts or a list (one per prompt).

use_tqdm

bool

default:"True"

Whether to display a progress bar during batch generation.

lora_request

LoRARequest | Sequence[LoRARequest]

default:"None"

LoRA adapter to apply during generation.

prompt_adapter_request

PromptAdapterRequest | Sequence[PromptAdapterRequest]

default:"None"

Prompt adapter request for generation.

kv_cache_retention_config

KvCacheRetentionConfig | Sequence[KvCacheRetentionConfig]

default:"None"

Configuration for retaining KV cache across requests.

disaggregated_params

DisaggregatedParams | Sequence[DisaggregatedParams]

default:"None"

Parameters for disaggregated serving (separating prefill and decode phases).

scheduling_params

SchedulingParams | List[SchedulingParams]

default:"None"

Advanced scheduling parameters.

cache_salt

str | Sequence[str]

default:"None"

Salt value to limit KV cache reuse to requests with the same salt.

Returns

outputs

RequestOutput | List[RequestOutput]

Generation results. Returns a single RequestOutput for single prompt input, or list for batch inputs.

generate_async()

Generate completions asynchronously for a single prompt.

future = llm.generate_async(
    inputs="Tell me about artificial intelligence",
    sampling_params=SamplingParams(temperature=0.7, max_tokens=100),
    streaming=True
)

# Wait for completion
output = future.result()

# Or stream results
for partial_output in future:
    print(partial_output.outputs[0].text_diff, end="", flush=True)

Parameters

inputs

PromptInputs | PreprocessedInputs

required

Input prompt. Can be text, token IDs, or a PreprocessedInputs object returned from preprocess().

sampling_params

SamplingParams

default:"None"

Sampling parameters for generation.

lora_request

LoRARequest

default:"None"

LoRA adapter to apply during generation.

prompt_adapter_request

PromptAdapterRequest

default:"None"

Prompt adapter request for generation.

streaming

bool

default:"False"

Enable streaming mode. When True, results can be iterated as they are generated.

kv_cache_retention_config

KvCacheRetentionConfig

default:"None"

Configuration for retaining KV cache.

disaggregated_params

DisaggregatedParams

default:"None"

Parameters for disaggregated serving.

scheduling_params

SchedulingParams

default:"None"

Advanced scheduling parameters.

cache_salt

str

default:"None"

Salt value to limit KV cache reuse.

Returns

output

RequestOutput

A RequestOutput future that can be awaited or iterated for streaming results.

preprocess()

Preprocess prompts into token IDs and multimodal parameters without generating. This is useful for separating preprocessing (CPU-heavy) from generation (GPU-heavy).

preprocessed = llm.preprocess(
    inputs="Tell me about AI",
    sampling_params=SamplingParams(max_tokens=100)
)

# Later, use preprocessed input
output = llm.generate_async(inputs=preprocessed)

Parameters

inputs

PromptInputs

required

Input prompt to preprocess.

sampling_params

SamplingParams

default:"None"

Sampling parameters (needed for special token handling).

disaggregated_params

DisaggregatedParams

default:"None"

Disaggregated serving parameters.

Returns

preprocessed

PreprocessedInputs

Preprocessed inputs that can be passed to generate_async().

save()

Save the built TensorRT engine to disk (TensorRT backend only).

llm.save("./saved_engine")

Parameters

engine_dir

str

required

Directory path where the engine should be saved.

get_stats()

Get runtime statistics from completed requests.

stats = llm.get_stats(timeout=2.0)
for stat in stats:
    print(f"Iteration {stat['iter']}: {stat['gpuMemUsage']} MB GPU memory")

Parameters

timeout

float

default:"2.0"

Maximum time (in seconds) to wait for statistics.

Returns

stats

List[dict]

List of runtime statistics dictionaries.

get_stats_async()

Get runtime statistics asynchronously as an async iterable.

async for stat in llm.get_stats_async(timeout=2.0):
    print(f"GPU memory: {stat['gpuMemUsage']} MB")

Parameters

timeout

float

default:"2.0"

Maximum time (in seconds) to wait for statistics.

Returns

stats

IterationResult

Async iterable object containing runtime statistics.

shutdown()

Shutdown the LLM instance and release resources.

llm.shutdown()

Attributes

tokenizer

TokenizerBase | None

The tokenizer instance loaded by the LLM, or None if skip_tokenizer_init=True.

workspace

Path | None

Temporary directory for intermediate files (TensorRT backend only).

llm_id

str

Unique identifier for this LLM instance.

disaggregated_params

dict

Disaggregated serving parameters for this instance.

Usage Examples

Basic Text Generation

from tensorrt_llm import LLM
from tensorrt_llm.sampling_params import SamplingParams

# Initialize LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Generate with custom parameters
output = llm.generate(
    "What is machine learning?",
    sampling_params=SamplingParams(
        temperature=0.7,
        top_p=0.9,
        max_tokens=200
    )
)

print(output.outputs[0].text)

Batch Generation

prompts = [
    "Tell me about Python",
    "What is TensorRT?",
    "Explain neural networks"
]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=100)
)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Output: {output.outputs[0].text}\n")

Streaming Generation

future = llm.generate_async(
    "Write a short story",
    sampling_params=SamplingParams(max_tokens=500),
    streaming=True
)

# Stream text as it's generated
for partial in future:
    new_text = partial.outputs[0].text_diff
    print(new_text, end="", flush=True)

Multi-GPU Inference

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,  # Use 4 GPUs
    dtype="bfloat16"
)

output = llm.generate("What is the meaning of life?")
print(output.outputs[0].text)

Using PyTorch Backend

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend="pytorch",
    dtype="float16"
)

output = llm.generate("Hello, world!")
print(output.outputs[0].text)

Context Manager

The LLM class supports context manager protocol for automatic resource cleanup:

with LLM(model="meta-llama/Llama-3.1-8B-Instruct") as llm:
    output = llm.generate("What is AI?")
    print(output.outputs[0].text)
# Resources automatically released

Python API

CLI Tools

Configuration

LLM

LLM Class

Constructor

Parameters

Methods

generate()

Parameters

Returns

generate_async()

Parameters

Returns

preprocess()

Parameters

Returns

save()

Parameters

get_stats()

Parameters

Returns

get_stats_async()

Parameters

Returns

shutdown()

Attributes

Usage Examples

Basic Text Generation

Batch Generation

Streaming Generation

Multi-GPU Inference

Using PyTorch Backend

Context Manager

Build docs developers (and LLMs) love

Python API

CLI Tools

Configuration

​LLM Class

​Constructor

​Parameters

​Methods

​generate()

​Parameters

​Returns

​generate_async()

​Parameters

​Returns

​preprocess()

​Parameters

​Returns

​save()

​Parameters

​get_stats()

​Parameters

​Returns

​get_stats_async()

​Parameters

​Returns

​shutdown()

​Attributes

​Usage Examples

​Basic Text Generation

​Batch Generation

​Streaming Generation

​Multi-GPU Inference

​Using PyTorch Backend

​Context Manager

Build docs developers (and LLMs) love

LLM Class

Constructor

Parameters

Methods

generate()

Parameters

Returns

generate_async()

Parameters

Returns

preprocess()

Parameters

Returns

save()

Parameters

get_stats()

Parameters

Returns

get_stats_async()

Parameters

Returns

shutdown()

Attributes

Usage Examples

Basic Text Generation

Batch Generation

Streaming Generation

Multi-GPU Inference

Using PyTorch Backend

Context Manager