generate()
Generate text from a loaded model given a formatted prompt, optional images, and optional audio. Returns the complete generation result after all tokens are produced.
Signature
Parameters
The loaded model returned by
load().The processor returned by
load().The formatted prompt string. Produce this with
apply_chat_template().Image path(s), URL(s), or
PIL.Image.Image object(s). Pass a list for multi-image inputs.Audio file path(s) or URL(s).
When
True, prints the prompt, generated tokens, and timing statistics to stdout.Maximum number of tokens to generate.
Sampling temperature. Use
0.0 for greedy (deterministic) decoding.Nucleus sampling probability mass. Only tokens within the top-p cumulative probability are considered.
Restrict sampling to the top-k tokens.
0 disables top-k filtering.Minimum probability threshold relative to the highest-probability token.
Penalty factor applied to tokens that have already appeared. Values above
1.0 discourage repetition.Number of recent tokens to consider when applying the repetition penalty.
Resize input images to
(height, width) before processing.Quantize the KV cache to this many bits. Reduces memory for long contexts.
Group size for KV cache quantization.
Maximum number of tokens to keep in the KV cache. Older entries are evicted.
Number of tokens to process per prefill step. Lower values reduce peak memory usage during long-context prefill.
Additional end-of-sequence token strings to stop generation.
Exclude special tokens from the decoded output text.
Enable thinking mode in the chat template (e.g. for Qwen3.5 reasoning models).
Maximum number of tokens allowed inside a thinking block. When exceeded, the model is forced to emit the thinking-end token.
Returns
A dataclass with the complete generation output. See
GenerationResult below.Example
stream_generate()
A generator that yields GenerationResult objects one token at a time. Use this for streaming output to a UI or terminal.
Signature
Parameters
stream_generate accepts the same parameters as generate() (minus verbose), passed as **kwargs. See the generate() parameters section above.
Yields
One
GenerationResult per generated token. The text field contains the decoded text segment for that token. Cumulative statistics (prompt_tokens, generation_tokens, prompt_tps, etc.) are updated on every yield.Example
batch_generate()
Generate responses for multiple prompts in a single batch. Groups same-sized images together to avoid spatial padding within each group, which improves accuracy and throughput.
Signature
Parameters
The loaded model returned by
load().The processor returned by
load().Image paths, URLs, or
PIL.Image.Image objects. One image per prompt.Audio file paths or URLs. Batched audio is not yet fully supported.
List of formatted prompt strings (one per sample). Produce each with
apply_chat_template().Maximum output tokens. Pass a single integer to apply the same limit to all prompts, or a list to set per-prompt limits.
Print batch statistics to stdout after generation.
Group images with the same spatial dimensions into sub-batches. This eliminates padding within each group and improves accuracy. Disable only if all images are already the same size.
Record the original
(height, width) of each image in the BatchResponse.image_sizes field.Returns
A dataclass containing generated texts and aggregate statistics. See
BatchResponse below.Example
Data classes
GenerationResult
Returned by generate() and yielded by stream_generate().
Decoded text for this generation step. In
stream_generate, this is the newly decoded segment for the current token. In generate, this is the full generated response.The last generated token ID.
Log-probabilities over the vocabulary for the last generated token.
Number of tokens in the input prompt.
Number of tokens generated so far.
Total tokens processed (
prompt_tokens + generation_tokens).Prompt processing speed in tokens per second.
Generation speed in tokens per second.
Peak memory usage in gigabytes.
BatchResponse
Returned by batch_generate().
Generated text for each prompt, in the same order as the input
prompts list.Aggregate performance statistics for the batch. See
BatchStats below.Original
(height, width) for each input image. Present when track_image_sizes=True.BatchStats
Total prompt tokens processed across all samples.
Aggregate prompt processing speed in tokens per second.
Total time in seconds spent processing prompts.
Total tokens generated across all samples.
Aggregate generation speed in tokens per second.
Total time in seconds spent generating tokens.
Peak memory usage in gigabytes.