Use a model after training

Once you have trained a model with SFTTrainer, DPOTrainer, GRPOTrainer, or any other TRL trainer, you can load it and run inference like any other Transformers model.

Load and generate

If you fine-tuned the model fully (without PEFT/LoRA), load it directly with the standard AutoModelForCausalLM class. Any trainer-specific components such as the value head from PPO training are automatically ignored:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "Qwen/Qwen3-0.6B"  # or path/to/your/model
device = "cuda"  # or "cpu"

model = AutoModelForCausalLM.from_pretrained(model_name_or_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

inputs = tokenizer.encode("This movie was really", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Alternatively, use the pipeline API:

from transformers import pipeline

model_name_or_path = "Qwen/Qwen3-0.6B"  # or path/to/your/model
pipe = pipeline("text-generation", model=model_name_or_path)
print(pipe("This movie was really")[0]["generated_text"])

Load and use PEFT adapters

If you trained with LoRA or another PEFT method, load the base model and then apply the adapter on top:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_name = "Qwen/Qwen3-0.6B"   # base model used during training
adapter_model_name = "path/to/my/adapter"

model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

With the adapter loaded, run generation as with a standard model.

Merge LoRA adapters into the base model

Merging adapters into the base model weights produces a single self-contained checkpoint that behaves exactly like a standard Transformers model — no PEFT dependency required at inference time.

Merged checkpoints are significantly larger than adapter-only checkpoints because they include all base model weights.

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model_name = "Qwen/Qwen3-0.6B"
adapter_model_name = "path/to/my/adapter"

model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)

# Merge adapter weights into the base model
model = model.merge_and_unload()
model.save_pretrained("merged_model")

After merging and saving, load the merged model as any other standard model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("merged_model")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

Push to the Hugging Face Hub

TRL trainers support pushing the trained model directly to the Hub at the end of training. Set push_to_hub=True in your training config:

from trl import SFTConfig

training_args = SFTConfig(
    ...,
    output_dir="my-model",
    push_to_hub=True,
)

Or push manually after training:

trainer.push_to_hub()

You can also use the standard Transformers API to push a loaded model:

model.push_to_hub("my-username/my-model")
tokenizer.push_to_hub("my-username/my-model")

Run an inference server

For production inference, consider running a dedicated inference server. The text-generation-inference library provides optimized serving for Transformers models, including models trained with TRL.

Get Started

Concepts

Trainers

How-to Guides

Integrations

Load and generate

Load and use PEFT adapters

Merge LoRA adapters into the base model

Push to the Hugging Face Hub

Run an inference server

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​Load and generate

​Load and use PEFT adapters

​Merge LoRA adapters into the base model

​Push to the Hugging Face Hub

​Run an inference server

Build docs developers (and LLMs) love

Load and generate

Load and use PEFT adapters

Merge LoRA adapters into the base model

Push to the Hugging Face Hub

Run an inference server