Skip to main content

Overview

Supervised Fine-Tuning (SFT) is the simplest and most commonly used method to adapt a language model to a target dataset. The model is trained in a fully supervised fashion using pairs of input and output sequences. The goal is to minimize the negative log-likelihood (NLL) of the target sequence, conditioning on the input. SFTTrainer supports both language modeling and prompt-completion datasets, and works with standard or conversational dataset formats. When provided with a conversational dataset, the trainer automatically applies the model’s chat template.

Quick start

from trl import SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()
To launch distributed training:
accelerate launch train_sft.py

Dataset format

SFTTrainer accepts four dataset formats:
# Standard language modeling
{"text": "The sky is blue."}

# Conversational language modeling
{"messages": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is blue."}]}

# Standard prompt-completion
{"prompt": "The sky is",
 "completion": " blue."}

# Conversational prompt-completion
{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "completion": [{"role": "assistant", "content": "It is blue."}]}
For prompt-completion datasets, loss is computed only on the completion tokens by default. For language modeling datasets, loss is computed on the full sequence.
If your dataset uses different column names, preprocess it to match the expected format:
from datasets import load_dataset

dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")

def preprocess_function(example):
    return {
        "prompt": [{"role": "user", "content": example["Question"]}],
        "completion": [
            {"role": "assistant", "content": f"<think>{example['Complex_CoT']}</think>{example['Response']}"}
        ],
    }

dataset = dataset.map(preprocess_function, remove_columns=["Question", "Response", "Complex_CoT"])

Key configuration parameters

max_length
int | None
default:"1024"
Maximum length of the tokenized sequence. Sequences longer than this are truncated. Set to None to disable truncation (recommended for VLMs).
packing
bool
default:"false"
Whether to pack multiple short sequences into fixed-length blocks, improving GPU utilization and reducing padding waste. Uses max_length to define block size.
packing_strategy
str
default:"bfd"
Strategy for packing sequences: "bfd" (best-fit decreasing, truncates overflow), "bfd_split" (best-fit decreasing, splits overflow sequences), or "wrapped" (aggressive, cuts mid-sequence).
dataset_text_field
str
default:"text"
Name of the column containing text data for language modeling datasets.
truncation_mode
str
default:"keep_start"
Which end to truncate when a sequence exceeds max_length. Options: "keep_start" or "keep_end".
completion_only_loss
bool | None
default:"None"
Whether to compute loss only on the completion part. When None, defaults to True for prompt-completion datasets and False for language modeling datasets.
assistant_only_loss
bool
default:"false"
Whether to compute loss only on assistant responses in conversational datasets. Requires a chat template that supports the {% generation %} and {% endgeneration %} keywords.
loss_type
str
default:"nll"
Type of loss to use. Options: "nll" (standard negative log-likelihood) or "dft" (Dynamic Fine-Tuning, which rectifies the reward signal to improve generalization).
model_init_kwargs
dict | None
Keyword arguments forwarded to AutoModelForCausalLM.from_pretrained when the model argument is a string. Useful for setting dtype, device_map, or output_router_logits for MoE models.
chat_template_path
str | None
Path to a tokenizer or a Jinja template file to set as the model’s chat template. Useful when fine-tuning base models that do not have a chat template.
eos_token
str | None
Token used to indicate end of sequence. Required when the chat template uses a different EOS token than the tokenizer’s default.
padding_free
bool
default:"false"
Perform forward passes without padding by flattening all sequences into a single continuous sequence. Requires FlashAttention 2 or 3. Automatically enabled when packing="bfd".
activation_offloading
bool
default:"false"
Offload activations to CPU to reduce GPU memory usage.
SFTConfig also overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=2e-5.

Instruction tuning

To turn a base model into an instruction-following model, provide a chat template and a conversational dataset:
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen3-0.6B-Base",
    args=SFTConfig(
        output_dir="Qwen3-0.6B-Instruct",
        chat_template_path="HuggingFaceTB/SmolLM3-3B",
    ),
    train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()
Some base models (such as Qwen models) already have a chat template in the tokenizer. In that case, you do not need to set chat_template_path, but you should align the EOS token. For example, for Qwen/Qwen2.5-1.5B, set eos_token="<|im_end|>" in SFTConfig.

Dataset packing

Packing is a technique to increase training efficiency by grouping multiple short examples into a single fixed-length block, reducing wasted padding tokens.
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen3-0.6B",
    args=SFTConfig(
        packing=True,
        max_length=2048,
    ),
    train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()
For best performance with packing, use FlashAttention 2 or 3. The padding_free option is automatically enabled with packing_strategy="bfd", eliminating padding overhead entirely.

Training with PEFT/LoRA

Use the PEFT library to train only a small set of adapter parameters instead of the full model:
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("trl-lib/Capybara", split="train")

trainer = SFTTrainer(
    "Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    peft_config=LoraConfig(),
)
trainer.train()
To continue training an existing PEFT model:
from datasets import load_dataset
from trl import SFTTrainer
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("trl-lib/Qwen3-4B-LoRA", is_trainable=True)
dataset = load_dataset("trl-lib/Capybara", split="train")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
)
trainer.train()
When training adapters, use a higher learning rate (around 1e-4) since only the new adapter parameters are being learned.

Training Vision-Language Models

SFTTrainer supports VLMs. Provide a dataset with an image column (single image) or images column (list of images):
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=SFTConfig(max_length=None),
    train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()
For VLMs, set max_length=None to prevent truncation from removing image tokens, which causes errors during training.

Logged metrics

MetricDescription
lossAverage cross-entropy loss over non-masked tokens
entropyAverage entropy of the model’s predicted token distribution
mean_token_accuracyProportion of tokens where top-1 prediction matches ground truth
learning_rateCurrent learning rate
grad_normL2 norm of gradients before clipping
num_tokensTotal tokens processed so far

Build docs developers (and LLMs) love