Skip to main content
Once you have trained a model with SFTTrainer, DPOTrainer, GRPOTrainer, or any other TRL trainer, you can load it and run inference like any other Transformers model.

Load and generate

If you fine-tuned the model fully (without PEFT/LoRA), load it directly with the standard AutoModelForCausalLM class. Any trainer-specific components such as the value head from PPO training are automatically ignored:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "Qwen/Qwen3-0.6B"  # or path/to/your/model
device = "cuda"  # or "cpu"

model = AutoModelForCausalLM.from_pretrained(model_name_or_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

inputs = tokenizer.encode("This movie was really", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
Alternatively, use the pipeline API:
from transformers import pipeline

model_name_or_path = "Qwen/Qwen3-0.6B"  # or path/to/your/model
pipe = pipeline("text-generation", model=model_name_or_path)
print(pipe("This movie was really")[0]["generated_text"])

Load and use PEFT adapters

If you trained with LoRA or another PEFT method, load the base model and then apply the adapter on top:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_name = "Qwen/Qwen3-0.6B"   # base model used during training
adapter_model_name = "path/to/my/adapter"

model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
With the adapter loaded, run generation as with a standard model.

Merge LoRA adapters into the base model

Merging adapters into the base model weights produces a single self-contained checkpoint that behaves exactly like a standard Transformers model — no PEFT dependency required at inference time.
Merged checkpoints are significantly larger than adapter-only checkpoints because they include all base model weights.
from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model_name = "Qwen/Qwen3-0.6B"
adapter_model_name = "path/to/my/adapter"

model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)

# Merge adapter weights into the base model
model = model.merge_and_unload()
model.save_pretrained("merged_model")
After merging and saving, load the merged model as any other standard model:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("merged_model")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

Push to the Hugging Face Hub

TRL trainers support pushing the trained model directly to the Hub at the end of training. Set push_to_hub=True in your training config:
from trl import SFTConfig

training_args = SFTConfig(
    ...,
    output_dir="my-model",
    push_to_hub=True,
)
Or push manually after training:
trainer.push_to_hub()
You can also use the standard Transformers API to push a loaded model:
model.push_to_hub("my-username/my-model")
tokenizer.push_to_hub("my-username/my-model")

Run an inference server

For production inference, consider running a dedicated inference server. The text-generation-inference library provides optimized serving for Transformers models, including models trained with TRL.

Build docs developers (and LLMs) love