trl.data_utils module (also importable from the top-level trl package) provides helpers for dataset preprocessing: applying chat templates, detecting conversational formats, extracting prompts from preference data, packing sequences, and converting legacy from/value formats to ChatML.
is_conversational
Checks whether a dataset example uses the conversational message format (role/content dicts).
Signature
Parameters
A single dataset entry. Supported keys inspected:
"prompt", "chosen", "rejected", "completion", "messages".Returns
bool — True if the first value found under a supported key is a list of dicts containing a "role" key.
Example
is_conversational_from_value
Checks whether a dataset example uses the legacyfrom/value conversational format (e.g., ShareGPT-style). This format is not recommended; prefer the ChatML role/content format.
Signature
Parameters
A single dataset entry. Inspects the
"conversations" key.Returns
bool — True if example["conversations"] is a list of dicts with both "from" and "value" keys.
Example
apply_chat_template
Applies the tokenizer’s chat template to a conversational example. Handles all supported dataset types:| Dataset type | Required keys |
|---|---|
| Language modeling | "messages" |
| Prompt-only | "prompt" |
| Prompt-completion | "prompt", "completion" |
| Preference | "prompt", "chosen", "rejected" |
| Preference (implicit prompt) | "chosen", "rejected" |
| Unpaired preference | "prompt", "completion", "label" |
"user" or "tool", a generation prompt is appended; if it is "assistant", the message is continued.
Signature
Parameters
Single dataset entry with conversational messages. May also contain a
"chat_template_kwargs" key with additional kwargs forwarded to the template renderer.Tokenizer or processor whose
apply_chat_template method will be called.List of tool definitions forwarded to the chat template. Has no effect if the template does not support function calling.
**template_kwargs
Additional keyword arguments passed directly to
tokenizer.apply_chat_template.Returns
dict[str, str] — Dictionary with the same keys as the input (except "messages" becomes "text"), all values converted to formatted strings.
Example
maybe_apply_chat_template
Appliesapply_chat_template only when the example is in a conversational format (detected via is_conversational). Non-conversational examples are returned unchanged.
Signature
Parameters
Single dataset entry, conversational or plain-text.
Tokenizer used to apply the chat template.
Tool definitions forwarded to the template.
**template_kwargs
Additional kwargs passed to
tokenizer.apply_chat_template.Returns
dict[str, str] — The formatted example if conversational; the original example otherwise.
extract_prompt
Extracts the shared prompt from a paired preference example where the prompt is implicit (i.e., both"chosen" and "rejected" include it as a prefix).
The function finds the longest common prefix of conversation turns between "chosen" and "rejected", removes it from both, and returns it as "prompt".
Signature
Parameters
Dictionary containing
"chosen" and "rejected" keys, each holding a list of conversation turns (either message dicts or plain strings).Returns
dict with keys:
"prompt": the extracted common prefix."chosen": remainder of the chosen completion."rejected": remainder of the rejected completion.
Example
maybe_extract_prompt
Extracts the shared prompt from a preference example only when necessary. If a"prompt" key already exists and is in the same format (both conversational or both plain-text), the example is returned as-is.
Signature
Parameters
Single dataset entry. Must contain
"chosen" and "rejected" to be treated as a preference example.Returns
The original example (if prompt extraction is not needed) or the result ofextract_prompt.
pack_dataset
Packs short sequences in a dataset into longer chunks of exactlyseq_length tokens, reducing padding and increasing training efficiency.
Three strategies are available:
| Strategy | Sequence boundaries | Token loss | Best for |
|---|---|---|---|
"bfd" | Preserved | Truncates overflow | SFT, chat datasets |
"bfd_split" | Preserved | None (splits overflow) | Pre-training, long documents |
"wrapped" | Ignored (cuts mid-sequence) | None | Pre-training (fastest) |
The
"bfd" and "bfd_split" strategies add a "seq_lengths" column that records the original sequence boundaries within each packed example — useful for constructing position_ids.Signature
Parameters
Dataset to pack. All columns must be lists of token-id lists.
Target packed sequence length in tokens.
Packing strategy. One of
"bfd", "bfd_split", or "wrapped".Extra keyword arguments forwarded to
dataset.map during packing (e.g., num_proc, batch_size).Returns
Dataset | DatasetDict — The packed dataset. The number of rows typically decreases as sequences are merged.
Example
unpair_preference_dataset
Converts a paired preference dataset ("chosen" / "rejected" columns) into an unpaired format where each pair becomes two separate rows, each with a "completion" and a boolean "label" column (True for chosen, False for rejected).
Signature
Parameters
Paired preference dataset with
"chosen", "rejected", and optionally "prompt" columns.Number of processes for parallel dataset processing.
Description shown alongside the progress bar during mapping.
Returns
Dataset | DatasetDict — Unpaired dataset with columns "prompt" (if present), "completion", and "label".
Example
maybe_unpair_preference_dataset
Callsunpair_preference_dataset only when the dataset contains both "chosen" and "rejected" columns. Otherwise returns the dataset unchanged.
Signature
Parameters
Preference dataset to conditionally unpair.
Number of processes for parallel processing.
Progress bar description.
Returns
Dataset | DatasetDict — Unpaired dataset if the input was paired, otherwise the original dataset.
maybe_convert_to_chatml
Converts a dataset example from the legacyfrom/value format (e.g., ShareGPT) to ChatML role/content format. If the example is already in ChatML format the function is a no-op.
The transformation performed:
- Renames
"from"→"role"in each message dict. - Renames
"value"→"content"in each message dict. - Renames the top-level key
"conversations"→"messages".
Signature
Parameters
Single dataset entry. Inspects keys
"prompt", "completion", "chosen", "rejected", "messages", and "conversations".Returns
dict[str, list] — Example in ChatML format.
Example
prepare_multimodal_messages
Converts a list of messages into a structured multimodal format and injects provided image objects into"image" placeholder blocks.
When the input messages have plain string "content" fields, the function:
- Wraps each string in
{"type": "text", "text": ...}. - Inserts
{"type": "image"}placeholder blocks before the text in the first user message. - Fills each
{"type": "image"}placeholder with the corresponding image object from theimageslist.
"content" fields are already structured lists, only step 3 is applied.
Signature
Parameters
List of message dicts with
"role" and "content" (or "tool_calls" for assistant turns). Roles "system", "user", "assistant", and "tool" are supported.Image objects (e.g., PIL images) to inject. May be empty if no images are referenced in the messages.
Returns
list[dict[str, Any]] — Deep copy of messages with all "content" fields converted to structured block lists and image placeholders populated.
Example
prepare_multimodal_messages_vllm
Converts structured multimodal messages from the standard TRL format into a format compatible with vLLM. Specifically, replaces"type": "image" content blocks (which use "image" key) with "type": "image_pil" blocks (which use "image_pil" key) as required by vLLM’s input format.
Use this function when passing pre-processed multimodal messages to a vLLM generation server.
Signature
Parameters
Messages with
"role" and "content" keys. Content is expected to be a list of structured blocks (as produced by prepare_multimodal_messages).Returns
list[dict[str, Any]] — Deep copy of messages with all {"type": "image", "image": ...} blocks replaced by {"type": "image_pil", "image_pil": ...} for vLLM compatibility.