Overview
Supervised Fine-Tuning (SFT) is the simplest and most commonly used method to adapt a language model to a target dataset. The model is trained in a fully supervised fashion using pairs of input and output sequences. The goal is to minimize the negative log-likelihood (NLL) of the target sequence, conditioning on the input.SFTTrainer supports both language modeling and prompt-completion datasets, and works with standard or conversational dataset formats. When provided with a conversational dataset, the trainer automatically applies the model’s chat template.
Quick start
Dataset format
SFTTrainer accepts four dataset formats:
For prompt-completion datasets, loss is computed only on the completion tokens by default. For language modeling datasets, loss is computed on the full sequence.
Key configuration parameters
Data preprocessing
Data preprocessing
Maximum length of the tokenized sequence. Sequences longer than this are truncated. Set to
None to disable truncation (recommended for VLMs).Whether to pack multiple short sequences into fixed-length blocks, improving GPU utilization and reducing padding waste. Uses
max_length to define block size.Strategy for packing sequences:
"bfd" (best-fit decreasing, truncates overflow), "bfd_split" (best-fit decreasing, splits overflow sequences), or "wrapped" (aggressive, cuts mid-sequence).Name of the column containing text data for language modeling datasets.
Which end to truncate when a sequence exceeds
max_length. Options: "keep_start" or "keep_end".Loss computation
Loss computation
Whether to compute loss only on the completion part. When
None, defaults to True for prompt-completion datasets and False for language modeling datasets.Whether to compute loss only on assistant responses in conversational datasets. Requires a chat template that supports the
{% generation %} and {% endgeneration %} keywords.Type of loss to use. Options:
"nll" (standard negative log-likelihood) or "dft" (Dynamic Fine-Tuning, which rectifies the reward signal to improve generalization).Model initialization
Model initialization
Keyword arguments forwarded to
AutoModelForCausalLM.from_pretrained when the model argument is a string. Useful for setting dtype, device_map, or output_router_logits for MoE models.Path to a tokenizer or a Jinja template file to set as the model’s chat template. Useful when fine-tuning base models that do not have a chat template.
Token used to indicate end of sequence. Required when the chat template uses a different EOS token than the tokenizer’s default.
Memory optimization
Memory optimization
Perform forward passes without padding by flattening all sequences into a single continuous sequence. Requires FlashAttention 2 or 3. Automatically enabled when
packing="bfd".Offload activations to CPU to reduce GPU memory usage.
SFTConfig also overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=2e-5.Instruction tuning
To turn a base model into an instruction-following model, provide a chat template and a conversational dataset:Dataset packing
Packing is a technique to increase training efficiency by grouping multiple short examples into a single fixed-length block, reducing wasted padding tokens.Training with PEFT/LoRA
Use the PEFT library to train only a small set of adapter parameters instead of the full model:Training Vision-Language Models
SFTTrainer supports VLMs. Provide a dataset with an image column (single image) or images column (list of images):
Logged metrics
| Metric | Description |
|---|---|
loss | Average cross-entropy loss over non-masked tokens |
entropy | Average entropy of the model’s predicted token distribution |
mean_token_accuracy | Proportion of tokens where top-1 prediction matches ground truth |
learning_rate | Current learning rate |
grad_norm | L2 norm of gradients before clipping |
num_tokens | Total tokens processed so far |