- Format — how the data is structured: standard (plain text strings) or conversational (lists of role/content messages).
- Type — what task the data is designed for: language modeling, prompt-only, prompt-completion, preference, unpaired preference, or stepwise supervision.
Formats
Standard
Standard datasets consist of plain text strings. Columns vary by task type.Conversational
Conversational datasets contain sequences of messages. Each message has arole ("user", "assistant", "system") and content.
Dataset types
Language modeling
A language modeling dataset contains a"text" column (standard) or "messages" column (conversational) with the full sequence.
- Standard
- Conversational
Prompt-only
Only the initial prompt is provided. The model is expected to generate a completion.- Standard
- Conversational
Prompt-completion
A prompt and its corresponding completion are both provided.- Standard
- Conversational
Preference
A preference dataset provides a"prompt", a "chosen" response, and a "rejected" response. The model learns to prefer "chosen" over "rejected".
Some datasets omit the "prompt" column — in that case the prompt is implicit and embedded in "chosen" and "rejected". Explicit prompts are recommended.
- Standard
- Conversational
Unpaired preference
Similar to a preference dataset, but instead of paired chosen/rejected completions for the same prompt, each example contains a single"completion" and a boolean "label" indicating whether it is preferred.
- Standard
- Conversational
Stepwise supervision
A stepwise (or process) supervision dataset provides multiple completion steps, each with its own label. This enables targeted feedback at each reasoning step — useful for tasks like mathematical reasoning.Which dataset type to use?
The right type depends on which trainer you are using:| Trainer | Expected dataset type |
|---|---|
SFTTrainer | Language modeling or prompt-completion |
DPOTrainer | Preference (explicit prompt recommended) |
RewardTrainer | Preference (implicit prompt recommended) |
GRPOTrainer | Prompt-only |
RLOOTrainer | Prompt-only |
BCOTrainer | Unpaired preference or preference (explicit prompt recommended) |
CPOTrainer | Preference (explicit prompt recommended) |
GKDTrainer | Prompt-completion |
KTOTrainer | Unpaired preference or preference (explicit prompt recommended) |
NashMDTrainer | Prompt-only |
OnlineDPOTrainer | Prompt-only |
ORPOTrainer | Preference (explicit prompt recommended) |
PRMTrainer | Stepwise supervision |
XPOTrainer | Prompt-only |
Tool calling
Some chat templates support tool calling, allowing the model to invoke external functions during generation. The model outputs a"tool_calls" field instead of a standard "content" message.
tools column containing the JSON schema of available tools. You can generate this schema automatically from Python function signatures:
Dataset with tool calling data, use on_mixed_types="use_json" to auto-apply the Json() type for tool arguments:
On
datasets versions older than 4.7.0 (which lack the Json() type), store tools as a JSON string: json.dumps([json_schema]).Applying chat templates
is_conversational
Use is_conversational to check whether a dataset example is in conversational format. It inspects the keys "prompt", "chosen", "rejected", "completion", and "messages" and returns True if the value is a list of role/content message dicts.
apply_chat_template
apply_chat_template converts a conversational example into plain strings using a tokenizer’s chat template. It handles all supported dataset types:
- Language modeling (
"messages"): produces a"text"key. - Prompt-only (
"prompt"): appends a generation prompt if the last role is"user", or continues the final message if it is"assistant". - Prompt-completion (
"prompt"+"completion"): produces separate"prompt"and"completion"strings. - Preference (
"prompt"+"chosen"+"rejected"): produces"prompt","chosen", and"rejected"strings. - Unpaired preference (
"prompt"+"completion"+"label"): produces string versions of"prompt"and"completion", passing"label"through unchanged.
maybe_apply_chat_template
maybe_apply_chat_template is a safe wrapper: it calls apply_chat_template only when is_conversational returns True. Use it in preprocessing pipelines where you cannot guarantee all examples are conversational.
Dataset packing with pack_dataset
Packing combines multiple short tokenized sequences into fixed-length chunks to improve training efficiency and reduce padding.
bfd (Best Fit Decreasing) — default
bfd (Best Fit Decreasing) — default
Preserves sequence boundaries. Sequences longer than
seq_length are truncated, discarding overflow tokens. Best for SFT and conversational datasets where maintaining conversation structure matters.bfd_split
bfd_split
Like
bfd, but splits overflow sequences into chunks instead of truncating them. Prevents token loss for pre-training or long-document datasets, but may break conversation structure in SFT datasets.wrapped
wrapped
Faster but more aggressive. Ignores sequence boundaries entirely and wraps tokens across examples to fill each packed chunk completely.
Converting between dataset types
Many publicly available datasets do not match TRL’s expected format out of the box. TRL provides several utility functions to help:extract_prompt— extract the shared prompt from a preference dataset with implicit prompt.unpair_preference_dataset— convert a paired preference dataset into an unpaired one.maybe_unpair_preference_dataset— unpair only if the dataset is currently paired.maybe_extract_prompt— extract the prompt only if one is not already present.maybe_convert_to_chatml— convertfrom/valueconversational format to ChatMLrole/contentformat.