Skip to main content

Overview

Fine-tuning adapts pre-trained language models to your specific use case, improving performance on domain-specific tasks while using fewer resources than training from scratch.

Llama 3.2 Fine-tuning

Fine-tune 1B and 3B models with LoRA

Gemma 3 Fine-tuning

Fine-tune 270M to 27B models efficiently

Why Fine-Tune?

Task-Specific Accuracy

Fine-tuning improves performance on specialized tasks:
TaskBase ModelFine-tunedImprovement
Domain Q&A62%89%+27%
Code generation45%78%+33%
Instruction following71%94%+23%
Custom format38%92%+54%

Parameter-Efficient Fine-Tuning

LoRA (Low-Rank Adaptation)

Fine-tune large models by training only 1-2% of parameters
LoRA adds small trainable matrices to existing model layers, dramatically reducing:
  • Training time: 3-5x faster
  • Memory usage: 3-4x less VRAM
  • Storage: Adapters are 10-100MB vs full model GBs
  • Cost: Train on free Google Colab

How LoRA Works

# Standard fine-tuning (updates ALL parameters)
W_new = W_original + ΔW  # ΔW is full matrix

# LoRA fine-tuning (low-rank decomposition)
W_new = W_original + A @ B  # A and B are small matrices

# Example:
# Original weight matrix: 4096 x 4096 = 16.8M parameters
# LoRA matrices: (4096 x 8) @ (8 x 4096) = 65K parameters
# Trainable params: 0.4% of original!
Key Parameters:
  • r (rank): Size of low-rank matrices (typically 8-64)
  • alpha: Scaling factor (typically 16-32)
  • target_modules: Which layers to adapt

Llama 3.2 Fine-Tuning

Fine-tune Meta’s Llama 3.2 (1B or 3B) for free in Google Colab

Quick Start

1

Install Dependencies

pip install torch transformers datasets trl unsloth
2

Run Fine-Tuning Script

python finetune_llama3.2.py
3

Use Fine-Tuned Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "finetuned_model"
)

response = model.generate("Your prompt here")

Complete Implementation

import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

# 1. Load model and tokenizer with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,  # Reduces memory by 75%
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

# 3. Prepare dataset
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)

# Apply chat template
dataset = dataset.map(
    lambda examples: {
        "text": [
            tokenizer.apply_chat_template(convo, tokenize=False)
            for convo in examples["conversations"]
        ]
    },
    batched=True
)

# 4. Configure trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # Increase for production
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        output_dir="outputs",
    ),
)

# 5. Train
trainer.train()

# 6. Save fine-tuned model
model.save_pretrained("finetuned_model")
tokenizer.save_pretrained("finetuned_model")

Model Selection

Llama 3.2 1B Instruct

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)
Specs:
  • Parameters: 1.2B
  • Context: 128K tokens
  • VRAM (4-bit): ~2GB
  • Training time: ~20 min (100 steps, Colab T4)
  • Use case: Fast, lightweight tasks

Training Configuration

TrainingArguments
class
Configure the fine-tuning process

LoRA Configuration

get_peft_model
function
Add LoRA adapters to model

Dataset Preparation

The standard format for instruction fine-tuning:
[
  {
    "conversations": [
      {"role": "user", "content": "What is machine learning?"},
      {"role": "assistant", "content": "Machine learning is..."}
    ]
  },
  {
    "conversations": [
      {"role": "user", "content": "Explain neural networks"},
      {"role": "assistant", "content": "Neural networks are..."}
    ]
  }
]
Multi-turn conversations:
{
  "conversations": [
    {"role": "user", "content": "Hi!"},
    {"role": "assistant", "content": "Hello! How can I help?"},
    {"role": "user", "content": "Tell me about AI"},
    {"role": "assistant", "content": "AI refers to..."}
  ]
}
from datasets import Dataset
import json

# Load your data
with open("my_data.json") as f:
    data = json.load(f)

# Convert to Hugging Face dataset
dataset = Dataset.from_list(data)

# Standardize format
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)

# Apply chat template
dataset = dataset.map(
    lambda examples: {
        "text": [
            tokenizer.apply_chat_template(convo, tokenize=False)
            for convo in examples["conversations"]
        ]
    },
    batched=True
)

# Use in training
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
)

Google Colab Setup

1

Open Colab Notebook

2

Select GPU Runtime

Runtime → Change runtime type → T4 GPU (free tier)
3

Install Dependencies

!pip install torch transformers datasets trl unsloth
4

Run Fine-Tuning

Copy the finetune_llama3.2.py code into a cell and run
5

Save to Google Drive

from google.colab import drive
drive.mount('/content/drive')

# Save model
model.save_pretrained("/content/drive/MyDrive/finetuned_model")

Gemma 3 Fine-Tuning

Fine-tune Google’s Gemma 3 models from 270M to 27B parameters

Model Sizes

270M

Ultra-lightweightVRAM: ~1GB

1B

Fast & efficientVRAM: ~2GB

4B

BalancedVRAM: ~6GB

12B

High qualityVRAM: ~16GB

27B

Best performanceVRAM: ~32GB

Implementation

import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

# Model selection
MODEL_NAME = "unsloth/gemma-3-1b-it"  # Change for different sizes
# Options: gemma-3-270m-it, gemma-3-1b-it, gemma-3-4b-it, 
#          gemma-3-12b-it, gemma-3-27b-it

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

# Prepare dataset
tokenizer = get_chat_template(tokenizer, chat_template="gemma")
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)

dataset = dataset.map(
    lambda examples: {
        "text": [
            tokenizer.apply_chat_template(convo, tokenize=False)
            for convo in examples["conversations"]
        ]
    },
    batched=True
)

# Configure trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        output_dir="outputs",
    ),
)

# Train
trainer.train()

# Save
model.save_pretrained("finetuned_model")

Gemma-Specific Notes

Gemma uses a specific chat format:
tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma"  # Important!
)
The template formats messages as:
<start_of_turn>user
Your question<end_of_turn>
<start_of_turn>model
Response<end_of_turn>
Approximate VRAM needs with 4-bit quantization:
Model4-bit8-bitFull (16-bit)
270M0.5GB0.8GB1.5GB
1B1.5GB2.5GB4.5GB
4B5GB9GB16GB
12B14GB25GB48GB
27B30GB55GB108GB
Choose based on use case:
  • 270M: Edge devices, real-time inference, simple tasks
  • 1B: Chatbots, content moderation, classification
  • 4B: Code generation, summarization, Q&A (Colab-friendly)
  • 12B: Complex reasoning, multi-turn conversations
  • 27B: Production apps requiring GPT-3.5 level quality

Advanced Topics

Evaluation During Training

from datasets import load_dataset

# Load with train/validation split
dataset = load_dataset("mlabonne/FineTome-100k")
train_data = dataset["train"]
eval_data = dataset["test"]  # or create your own split

# Configure trainer with evaluation
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,  # Add validation set
    args=TrainingArguments(
        # ... other args
        evaluation_strategy="steps",
        eval_steps=50,  # Evaluate every 50 steps
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
    ),
)

Inference After Fine-Tuning

from unsloth import FastLanguageModel

# Load fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    "finetuned_model",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Prepare for inference
FastLanguageModel.for_inference(model)

# Generate
messages = [
    {"role": "user", "content": "What is quantum computing?"}
]

input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Merging LoRA with Base Model

from unsloth import FastLanguageModel

# Load model with LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    "finetuned_model",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Merge LoRA weights into base model
model = FastLanguageModel.for_inference(model)
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")

# Upload to Hugging Face Hub (optional)
model.push_to_hub("your-username/model-name")
tokenizer.push_to_hub("your-username/model-name")

Quantization Options

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name,
    load_in_4bit=True,  # 75% memory reduction
)
  • VRAM: 25% of full model
  • Speed: 90% of full model
  • Quality: 98-99% of full model
  • Best for: Most use cases

Best Practices

High-quality data is more important than quantity:
  • ✅ 1,000 high-quality examples > 10,000 mediocre examples
  • ✅ Clean, consistent formatting
  • ✅ Diverse range of topics/styles
  • ✅ Representative of actual use case
  • ❌ Avoid duplicate or near-duplicate examples
  • ❌ Don’t include low-quality or incorrect data
Don’t overtrain:
# Start small
max_steps=100  # Initial run

# Monitor eval loss
evaluation_strategy="steps"
eval_steps=20

# Early stopping
load_best_model_at_end=True
early_stopping_patience=3
Typical steps by dataset size:
  • 1K examples: 100-300 steps
  • 10K examples: 300-1000 steps
  • 100K examples: 1000-3000 steps
Start with defaults, then adjust:
ParameterDefaultIncrease if…Decrease if…
learning_rate2e-4Loss plateausLoss explodes
r (rank)16Need better qualityOut of memory
max_steps60UnderfittingOverfitting
batch_size2Have more VRAMOut of memory
Evaluate thoroughly:
# Create test prompts
test_prompts = [
    "Easy question",
    "Medium question",
    "Hard question",
    "Edge case",
]

# Test before and after
for prompt in test_prompts:
    base_response = base_model.generate(prompt)
    finetuned_response = finetuned_model.generate(prompt)
    
    print(f"Prompt: {prompt}")
    print(f"Base: {base_response}")
    print(f"Fine-tuned: {finetuned_response}")
    print("---")

Troubleshooting

Solutions:
  1. Reduce batch size:
per_device_train_batch_size=1
gradient_accumulation_steps=8
  1. Use 4-bit quantization:
load_in_4bit=True
  1. Reduce sequence length:
max_seq_length=1024  # instead of 2048
  1. Use smaller LoRA rank:
r=8  # instead of 16
Possible causes:
  1. Too few training steps: Increase max_steps
  2. Low-quality data: Clean and curate dataset
  3. Wrong chat template: Use correct template for model
  4. Learning rate too high: Reduce to 1e-4 or 5e-5
  5. Overfitting: Add validation set, reduce steps
Speed up training:
  1. Use smaller model (1B instead of 3B)
  2. Reduce sequence length
  3. Use gradient checkpointing:
gradient_checkpointing=True
  1. Ensure CUDA is used:
print(torch.cuda.is_available())  # Should be True

Resources

Unsloth GitHub

Fast fine-tuning library

Gemma 3 Blog

Gemma 3 fine-tuning guide

Example Code

Complete fine-tuning scripts

Tutorial

Step-by-step fine-tuning tutorial

Build docs developers (and LLMs) love