Skip to main content
TRL distinguishes between two orthogonal dimensions when describing a dataset:
  • Format — how the data is structured: standard (plain text strings) or conversational (lists of role/content messages).
  • Type — what task the data is designed for: language modeling, prompt-only, prompt-completion, preference, unpaired preference, or stepwise supervision.

Formats

Standard

Standard datasets consist of plain text strings. Columns vary by task type.
# Language modeling
{"text": "The sky is blue."}

# Prompt-only
{"prompt": "The sky is"}

# Prompt-completion
{"prompt": "The sky is", "completion": " blue."}

# Preference
{"prompt": "The sky is", "chosen": " blue.", "rejected": " green."}

# Unpaired preference
{"prompt": "The sky is", "completion": " blue.", "label": True}

Conversational

Conversational datasets contain sequences of messages. Each message has a role ("user", "assistant", "system") and content.
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

Dataset types

Language modeling

A language modeling dataset contains a "text" column (standard) or "messages" column (conversational) with the full sequence.
{"text": "The sky is blue."}

Prompt-only

Only the initial prompt is provided. The model is expected to generate a completion.
{"prompt": "The sky is"}
Prompt-only and language modeling types differ in how TRL processes them. For prompt-only, apply_chat_template appends a generation prompt (<|assistant|>\n), signaling the model to continue. For language modeling, the input is treated as a complete sequence and terminated with an end-of-text token.
from transformers import AutoTokenizer
from trl import apply_chat_template

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

# Prompt-only
apply_chat_template({"prompt": [{"role": "user", "content": "What color is the sky?"}]}, tokenizer)
# Output: {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}

# Language modeling
apply_chat_template({"messages": [{"role": "user", "content": "What color is the sky?"}]}, tokenizer)
# Output: {'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}

Prompt-completion

A prompt and its corresponding completion are both provided.
{"prompt": "The sky is", "completion": " blue."}

Preference

A preference dataset provides a "prompt", a "chosen" response, and a "rejected" response. The model learns to prefer "chosen" over "rejected". Some datasets omit the "prompt" column — in that case the prompt is implicit and embedded in "chosen" and "rejected". Explicit prompts are recommended.
# Explicit prompt (recommended)
{"prompt": "The sky is", "chosen": " blue.", "rejected": " green."}

# Implicit prompt
{"chosen": "The sky is blue.", "rejected": "The sky is green."}

Unpaired preference

Similar to a preference dataset, but instead of paired chosen/rejected completions for the same prompt, each example contains a single "completion" and a boolean "label" indicating whether it is preferred.
{"prompt": "The sky is", "completion": " blue.", "label": True}

Stepwise supervision

A stepwise (or process) supervision dataset provides multiple completion steps, each with its own label. This enables targeted feedback at each reasoning step — useful for tasks like mathematical reasoning.
{
    "prompt": "Which number is larger, 9.8 or 9.11?",
    "completions": [
        "The fractional part of 9.8 is 0.8, while the fractional part of 9.11 is 0.11.",
        "Since 0.11 is greater than 0.8, the number 9.11 is larger than 9.8."
    ],
    "labels": [True, False]
}

Which dataset type to use?

The right type depends on which trainer you are using:
TrainerExpected dataset type
SFTTrainerLanguage modeling or prompt-completion
DPOTrainerPreference (explicit prompt recommended)
RewardTrainerPreference (implicit prompt recommended)
GRPOTrainerPrompt-only
RLOOTrainerPrompt-only
BCOTrainerUnpaired preference or preference (explicit prompt recommended)
CPOTrainerPreference (explicit prompt recommended)
GKDTrainerPrompt-completion
KTOTrainerUnpaired preference or preference (explicit prompt recommended)
NashMDTrainerPrompt-only
OnlineDPOTrainerPrompt-only
ORPOTrainerPreference (explicit prompt recommended)
PRMTrainerStepwise supervision
XPOTrainerPrompt-only

Tool calling

Some chat templates support tool calling, allowing the model to invoke external functions during generation. The model outputs a "tool_calls" field instead of a standard "content" message.
messages = [
    {"role": "user", "content": "Turn on the living room lights."},
    {"role": "assistant", "tool_calls": [
        {"type": "function", "function": {
            "name": "control_light",
            "arguments": {"room": "living room", "state": "on"}
        }}
    ]},
    {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
    {"role": "assistant", "content": "Done!"}
]
When preparing SFT datasets for tool calling, include a tools column containing the JSON schema of available tools. You can generate this schema automatically from Python function signatures:
from transformers.utils import get_json_schema

def control_light(room: str, state: str) -> str:
    """
    Controls the lights in a room.

    Args:
        room: The name of the room.
        state: The desired state of the light ("on" or "off").

    Returns:
        str: A message indicating the new state of the lights.
    """
    return f"The lights in {room} are now {state}."

json_schema = get_json_schema(control_light)
A complete dataset entry:
{"messages": messages, "tools": [json_schema]}
To build a Dataset with tool calling data, use on_mixed_types="use_json" to auto-apply the Json() type for tool arguments:
from datasets import Dataset

dataset = Dataset.from_list(data, on_mixed_types="use_json")
On datasets versions older than 4.7.0 (which lack the Json() type), store tools as a JSON string: json.dumps([json_schema]).

Applying chat templates

is_conversational

Use is_conversational to check whether a dataset example is in conversational format. It inspects the keys "prompt", "chosen", "rejected", "completion", and "messages" and returns True if the value is a list of role/content message dicts.
from trl import is_conversational

# Conversational
example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
is_conversational(example)  # True

# Standard
example = {"prompt": "The sky is"}
is_conversational(example)  # False

apply_chat_template

apply_chat_template converts a conversational example into plain strings using a tokenizer’s chat template. It handles all supported dataset types:
  • Language modeling ("messages"): produces a "text" key.
  • Prompt-only ("prompt"): appends a generation prompt if the last role is "user", or continues the final message if it is "assistant".
  • Prompt-completion ("prompt" + "completion"): produces separate "prompt" and "completion" strings.
  • Preference ("prompt" + "chosen" + "rejected"): produces "prompt", "chosen", and "rejected" strings.
  • Unpaired preference ("prompt" + "completion" + "label"): produces string versions of "prompt" and "completion", passing "label" through unchanged.
from transformers import AutoTokenizer
from trl import apply_chat_template

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "completion": [{"role": "assistant", "content": "It is blue."}],
}
apply_chat_template(example, tokenizer)
# {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n',
#  'completion': 'It is blue.<|end|>\n'}

maybe_apply_chat_template

maybe_apply_chat_template is a safe wrapper: it calls apply_chat_template only when is_conversational returns True. Use it in preprocessing pipelines where you cannot guarantee all examples are conversational.
from trl import maybe_apply_chat_template

# Conversational example → template is applied
conversational_example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "completion": [{"role": "assistant", "content": "It is blue."}],
}
maybe_apply_chat_template(conversational_example, tokenizer)
# {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n',
#  'completion': 'It is blue.<|end|>\n'}

# Standard example → returned unchanged
standard_example = {"prompt": "The sky is", "completion": " blue."}
maybe_apply_chat_template(standard_example, tokenizer)
# {'prompt': 'The sky is', 'completion': ' blue.'}

Dataset packing with pack_dataset

Packing combines multiple short tokenized sequences into fixed-length chunks to improve training efficiency and reduce padding.
from trl import pack_dataset

packed = pack_dataset(tokenized_dataset, seq_length=2048)
Three strategies are available:
Preserves sequence boundaries. Sequences longer than seq_length are truncated, discarding overflow tokens. Best for SFT and conversational datasets where maintaining conversation structure matters.
packed = pack_dataset(dataset, seq_length=4, strategy="bfd")
Like bfd, but splits overflow sequences into chunks instead of truncating them. Prevents token loss for pre-training or long-document datasets, but may break conversation structure in SFT datasets.
packed = pack_dataset(dataset, seq_length=4, strategy="bfd_split")
Faster but more aggressive. Ignores sequence boundaries entirely and wraps tokens across examples to fill each packed chunk completely.
packed = pack_dataset(dataset, seq_length=4, strategy="wrapped")
The bfd strategy adds a seq_lengths column to the output dataset that records the length of each original sequence within the packed chunk. Downstream code may need to be aware of this extra column.

Converting between dataset types

Many publicly available datasets do not match TRL’s expected format out of the box. TRL provides several utility functions to help:
  • extract_prompt — extract the shared prompt from a preference dataset with implicit prompt.
  • unpair_preference_dataset — convert a paired preference dataset into an unpaired one.
  • maybe_unpair_preference_dataset — unpair only if the dataset is currently paired.
  • maybe_extract_prompt — extract the prompt only if one is not already present.
  • maybe_convert_to_chatml — convert from/value conversational format to ChatML role/content format.
For example, to convert a preference dataset with implicit prompt to an unpaired preference dataset:
from datasets import Dataset
from trl import extract_prompt, unpair_preference_dataset

dataset = Dataset.from_dict({
    "chosen": [
        [{"role": "user", "content": "What color is the sky?"},
         {"role": "assistant", "content": "It is blue."}],
    ],
    "rejected": [
        [{"role": "user", "content": "What color is the sky?"},
         {"role": "assistant", "content": "It is green."}],
    ],
})

dataset = dataset.map(extract_prompt)
dataset = unpair_preference_dataset(dataset)

# dataset[0]:
# {'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
#  'completion': [{'role': 'assistant', 'content': 'It is blue.'}],
#  'label': True}
Before calling unpair_preference_dataset, ensure that all "chosen" completions can be labeled as good and all "rejected" completions as bad. The "chosen" and "rejected" completions in a paired dataset are not always absolutely good or bad — they may only be relatively better or worse. Check absolute quality (for example, using a reward model) before converting.
For additional conversion examples, including from stepwise supervision and between preference variants, see the example scripts in the TRL repository.

Build docs developers (and LLMs) love