RequestOutput

The RequestOutput class represents the output of a completion request to the LLM. It contains the generated text, tokens, metadata, and optional performance metrics.

Class Definition

from tensorrt_llm.llmapi import RequestOutput

# RequestOutput is returned by LLM.generate() and LLM.generate_async()
output: RequestOutput = llm.generate("What is AI?")

RequestOutput instances are created internally by the executor. Users should not instantiate this class directly.

Attributes

request_id

int

Unique identifier for this generation request.

print(f"Request ID: {output.request_id}")

prompt

str | None

The original prompt string that was provided to the LLM. May be None if the request was submitted as token IDs only.

if output.prompt:
    print(f"Original prompt: {output.prompt}")

prompt_token_ids

List[int]

Token IDs of the input prompt after tokenization.

print(f"Prompt length: {len(output.prompt_token_ids)} tokens")

outputs

List[CompletionOutput]

List of completion outputs. Length equals sampling_params.n (number of sequences requested).Each CompletionOutput contains:

Generated text
Token IDs
Log probabilities (if requested)
Finish reason

# Single output
text = output.outputs[0].text

# Multiple outputs (when n > 1)
for i, completion in enumerate(output.outputs):
    print(f"Output {i}: {completion.text}")

context_logits

torch.Tensor | None

Logits tensor for prompt tokens. Shape: [prompt_length, vocab_size].Only available when sampling_params.return_context_logits=True.

if output.context_logits is not None:
    print(f"Context logits shape: {output.context_logits.shape}")

disaggregated_params

DisaggregatedParams | None

Parameters for disaggregated serving, including:

Request type (context_only, generation_only)
Multimodal embedding handles
Context request ID
First generated tokens

Only populated in disaggregated serving mode.

finished

bool

Whether the entire request has finished processing.

if output.finished:
    print("Generation complete")

Methods

result()

Wait for the generation to complete and return the result (for async requests).

future = llm.generate_async("Tell me about AI")
output = future.result(timeout=30.0)
print(output.outputs[0].text)

Parameters

timeout

float

default:"None"

Maximum time to wait in seconds. None means wait indefinitely.

Returns

output

RequestOutput

The completed request output.

Streaming (Iteration)

For streaming requests, RequestOutput can be iterated to get partial results:

future = llm.generate_async("Write a story", streaming=True)

# Synchronous iteration
for partial_output in future:
    new_text = partial_output.outputs[0].text_diff
    print(new_text, end="", flush=True)

# Async iteration
async for partial_output in future:
    new_text = partial_output.outputs[0].text_diff
    print(new_text, end="", flush=True)

CompletionOutput

Each RequestOutput.outputs entry is a CompletionOutput object containing the generated text and metadata.

Attributes

index

int

Index of this output in the request (0 to n-1).

text

str

The generated text output.

generated_text = output.outputs[0].text

token_ids

List[int]

Token IDs of the generated output.

tokens = output.outputs[0].token_ids
print(f"Generated {len(tokens)} tokens")

cumulative_logprob

float | None

Sum of log probabilities for all generated tokens. Used for ranking multiple outputs.

logprob = output.outputs[0].cumulative_logprob
perplexity = math.exp(-logprob / len(output.outputs[0].token_ids))

logprobs

List[Dict[int, Logprob]] | None

Log probabilities for generated tokens (if sampling_params.logprobs was set).Each entry is a dictionary mapping token IDs to Logprob objects with:

logprob: Log probability value
rank: Rank among all tokens (1 = most likely)

if output.outputs[0].logprobs:
    for token_logprobs in output.outputs[0].logprobs:
        for token_id, logprob_info in token_logprobs.items():
            print(f"Token {token_id}: logprob={logprob_info.logprob}, rank={logprob_info.rank}")

prompt_logprobs

List[Dict[int, Logprob]] | None

Log probabilities for prompt tokens (if sampling_params.prompt_logprobs was set). Same format as logprobs.

finish_reason

Literal['stop', 'length', 'timeout', 'cancelled'] | None

Reason why generation stopped:

'stop': Hit stop string, stop token, or EOS token
'length': Reached max_tokens limit
'timeout': Request timed out
'cancelled': Request was cancelled
None: Still generating (streaming mode)

stop_reason

int | str | None

The specific stop string or token ID that caused generation to stop. Only set when finish_reason='stop'.

if output.outputs[0].finish_reason == 'stop':
    print(f"Stopped by: {output.outputs[0].stop_reason}")

generation_logits

torch.Tensor | None

Full logits tensor for generated tokens. Shape: [num_tokens, vocab_size].Only available when sampling_params.return_generation_logits=True.

additional_context_outputs

Dict[str, torch.Tensor] | None

Additional model outputs from the context (prompt) phase.

additional_generation_outputs

Dict[str, torch.Tensor] | None

Additional model outputs from the generation phase.

disaggregated_params

DisaggregatedParams | None

Disaggregated serving parameters for this output.

request_perf_metrics

RequestPerfMetrics | None

Performance metrics for this request (if sampling_params.return_perf_metrics=True):

Time to first token (TTFT)
Time per output token (TPOT)
End-to-end latency (E2E)
Queue time
Throughput

Streaming Properties

length

int

Number of tokens generated so far.

print(f"Generated {output.outputs[0].length} tokens")

text_diff

str

Newly generated text since the last update (streaming mode).

for partial in future:
    new_text = partial.outputs[0].text_diff
    print(new_text, end="", flush=True)

token_ids_diff

List[int]

Newly generated token IDs since the last update (streaming mode).

logprobs_diff

List[Dict[int, Logprob]]

Log probabilities for newly generated tokens (streaming mode).

Usage Examples

Basic Output Access

output = llm.generate("What is machine learning?")

print(f"Request ID: {output.request_id}")
print(f"Prompt: {output.prompt}")
print(f"Generated text: {output.outputs[0].text}")
print(f"Finish reason: {output.outputs[0].finish_reason}")

Multiple Outputs

from tensorrt_llm.sampling_params import SamplingParams

params = SamplingParams(n=3, temperature=0.8, max_tokens=100)
output = llm.generate("Write a haiku", sampling_params=params)

for i, completion in enumerate(output.outputs):
    print(f"\n=== Output {i+1} ===")
    print(completion.text)
    print(f"Cumulative log prob: {completion.cumulative_logprob}")

Log Probabilities

params = SamplingParams(logprobs=5, max_tokens=50)
output = llm.generate("The capital of France is", sampling_params=params)

for i, token_logprobs in enumerate(output.outputs[0].logprobs):
    print(f"\nToken position {i}:")
    for token_id, logprob_info in token_logprobs.items():
        token_text = llm.tokenizer.decode([token_id])
        print(f"  {token_text!r}: logprob={logprob_info.logprob:.3f}, rank={logprob_info.rank}")

Streaming with text_diff

future = llm.generate_async(
    "Write a short story about a robot",
    sampling_params=SamplingParams(max_tokens=500),
    streaming=True
)

for partial_output in future:
    # Print only new text since last iteration
    new_text = partial_output.outputs[0].text_diff
    print(new_text, end="", flush=True)
    
print("\n\nGeneration complete!")
print(f"Final finish reason: {partial_output.outputs[0].finish_reason}")

Performance Metrics

params = SamplingParams(return_perf_metrics=True, max_tokens=256)
output = llm.generate("Tell me about quantum computing", sampling_params=params)

metrics = output.outputs[0].request_perf_metrics
if metrics:
    print(f"Time to first token: {metrics.ttft:.3f}s")
    print(f"Time per output token: {metrics.tpot:.3f}s")
    print(f"End-to-end latency: {metrics.e2e:.3f}s")
    print(f"Throughput: {metrics.throughput:.1f} tokens/s")

Async/Await

import asyncio

async def generate_text():
    future = llm.generate_async("Explain neural networks")
    output = await future
    return output.outputs[0].text

text = asyncio.run(generate_text())
print(text)

Python API

CLI Tools

Configuration

RequestOutput

RequestOutput

Class Definition

Attributes

Methods

result()

Parameters

Returns

Streaming (Iteration)

CompletionOutput

Attributes

Streaming Properties

Usage Examples

Basic Output Access

Multiple Outputs

Log Probabilities

Streaming with text_diff

Performance Metrics

Async/Await

See Also

Build docs developers (and LLMs) love

Python API

CLI Tools

Configuration

​RequestOutput

​Class Definition

​Attributes

​Methods

​result()

​Parameters

​Returns

​Streaming (Iteration)

​CompletionOutput

​Attributes

​Streaming Properties

​Usage Examples

​Basic Output Access

​Multiple Outputs

​Log Probabilities

​Streaming with text_diff

​Performance Metrics

​Async/Await

​See Also

Build docs developers (and LLMs) love

RequestOutput

Class Definition

Attributes

Methods

result()

Parameters

Returns

Streaming (Iteration)

CompletionOutput

Attributes

Streaming Properties

Usage Examples

Basic Output Access

Multiple Outputs

Log Probabilities

Streaming with text_diff

Performance Metrics

Async/Await

See Also