Skip to main content
Finetuning a pretrained GPT-2 model is identical to training from scratch, except you initialize from a checkpoint and use a smaller learning rate. This allows you to adapt large models to your specific domain with minimal compute.

Quick start

Finetune GPT-2 XL (1.5B parameters) on Shakespeare in just a few minutes:
1

Prepare the dataset

Download and tokenize Shakespeare using GPT-2 BPE:
python data/shakespeare/prepare.py
This creates train.bin and val.bin using the OpenAI BPE tokenizer (unlike character-level encoding).
2

Run finetuning

python train.py config/finetune_shakespeare.py
This loads the pretrained gpt2-xl checkpoint and finetunes it on Shakespeare.
3

Sample from the model

python sample.py --out_dir=out-shakespeare

Finetuning configuration

The config/finetune_shakespeare.py file shows a complete finetuning setup:
import time

out_dir = 'out-shakespeare'
eval_interval = 5
eval_iters = 40
wandb_log = False

dataset = 'shakespeare'
init_from = 'gpt2-xl'  # Largest GPT-2 model (1558M params)

# Only save checkpoints if validation loss improves
always_save_checkpoint = False

# Batch size calculation:
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# Shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# Finetune at constant learning rate
learning_rate = 3e-5
decay_lr = False

Key differences from pretraining

ParameterPretrainingFinetuning
init_from'scratch''gpt2-xl'
learning_rate6e-43e-5 (10-20x lower)
decay_lrTrueFalse (constant LR)
max_iters60000020 (much shorter)
dropout0.00.1+ (optional regularization)

Available pretrained models

You can initialize from any OpenAI GPT-2 checkpoint:
ModelParametersContext LengthMemory (fp16)
gpt2124M1024~500MB
gpt2-medium350M1024~1.5GB
gpt2-large774M1024~3GB
gpt2-xl1558M1024~6GB
Set the model in your config:
init_from = 'gpt2-xl'  # or 'gpt2', 'gpt2-medium', 'gpt2-large'

Initialization logic

From train.py:181-189, the script downloads and loads pretrained weights:
elif init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
    
    # Initialize from OpenAI GPT-2 weights
    override_args = dict(dropout=dropout)
    model = GPT.from_pretrained(init_from, override_args)
    
    # Read config params for checkpointing
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = getattr(model.config, k)
This automatically downloads weights from Hugging Face on first use.

Preparing custom datasets

Using BPE tokenization

For best results, use the same GPT-2 BPE tokenizer as pretraining:
import tiktoken
import numpy as np

# Load GPT-2 tokenizer
enc = tiktoken.get_encoding("gpt2")

# Tokenize your text
with open('your_data.txt', 'r') as f:
    text = f.read()

ids = enc.encode_ordinary(text)
ids.append(enc.eot_token)  # Add end-of-text token

# Save as binary file
arr = np.array(ids, dtype=np.uint16)
arr.tofile('train.bin')

Train/validation split

Create separate files:
# Split data 90/10
split_idx = int(len(ids) * 0.9)

train_ids = ids[:split_idx]
val_ids = ids[split_idx:]

np.array(train_ids, dtype=np.uint16).tofile('train.bin')
np.array(val_ids, dtype=np.uint16).tofile('val.bin')

Directory structure

data/
└── your_dataset/
    ├── prepare.py       # Tokenization script
    ├── train.bin        # Training data
    ├── val.bin          # Validation data
    └── meta.pkl         # Vocabulary metadata (optional)

Finetuning hyperparameters

Learning rate

Use a much lower learning rate than pretraining:
learning_rate = 3e-5  # Typical range: 1e-5 to 1e-4
decay_lr = False       # Often use constant LR
Too high a learning rate can destroy pretrained knowledge. Start conservatively and increase if training is too slow.

Regularization

Add dropout to prevent overfitting on small datasets:
dropout = 0.1  # 0.1 to 0.2 for finetuning (0.0 for pretraining)
From the override logic:
override_args = dict(dropout=dropout)
model = GPT.from_pretrained(init_from, override_args)

Training duration

Finetuning requires far fewer iterations:
max_iters = 20  # For tiny datasets like Shakespeare
# Scale up for larger datasets:
# - Small dataset (1M tokens): 100-500 iters
# - Medium dataset (100M tokens): 1000-5000 iters
# - Large dataset (1B+ tokens): 10000+ iters

Batch size

Adjust based on dataset size and memory:
# Small dataset (Shakespeare: 300K tokens)
batch_size = 1
gradient_accumulation_steps = 32
# Effective batch size: 1 * 32 * 1024 = 32,768 tokens

# Larger dataset
batch_size = 4
gradient_accumulation_steps = 8
# Effective batch size: 4 * 8 * 1024 = 32,768 tokens

Example finetuning output

After finetuning GPT-2 XL on Shakespeare:
THEODORE:
Thou shalt sell me to the highest bidder: if I die,
I sell thee to the first; if I go mad,
I sell thee to the second; if I
lie, I sell thee to the third; if I slay,
I sell thee to the fourth: so buy or sell,
I tell thee again, thou shalt not sell my
possession.

JULIET:
And if thou steal, thou shalt not sell thyself.

THEODORE:
I do not steal; I sell the stolen goods.

THEODORE:
Thou know'st not what thou sell'st; thou, a woman,
Thou art ever a victim, a thing of no worth:
Thou hast no right, no right, but to be sold.
The model generates coherent Shakespearean-style dialogue with proper formatting and character names after just ~20 iterations.

Memory management

Reduce memory usage

If you run out of GPU memory:
init_from = 'gpt2'  # 124M params instead of gpt2-xl (1558M)

Multi-GPU finetuning

Scale finetuning across multiple GPUs:
torchrun --standalone --nproc_per_node=4 train.py config/finetune_shakespeare.py
The same DDP logic applies as in pretraining. See Distributed Training for details.

Resume from checkpoint

Resume finetuning from a saved checkpoint:
init_from = 'resume'
out_dir = 'out-shakespeare'  # Directory with ckpt.pt
Run:
python train.py config/finetune_shakespeare.py --init_from=resume
From train.py:158-180:
elif init_from == 'resume':
    print(f"Resuming training from {out_dir}")
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    checkpoint_model_args = checkpoint['model_args']
    
    # Restore model architecture
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = checkpoint_model_args[k]
    
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    model.load_state_dict(checkpoint['model'])
    
    # Restore training state
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']

Domain adaptation strategies

Continue pretraining

For large domain-specific corpora (e.g., medical, legal, code):
init_from = 'gpt2-xl'
learning_rate = 1e-4  # Slightly higher than pure finetuning
max_iters = 10000      # More iterations
decay_lr = True        # Use learning rate decay

Task-specific finetuning

For specific tasks (e.g., summarization, Q&A):
init_from = 'gpt2-xl'
learning_rate = 3e-5   # Low learning rate
max_iters = 1000       # Moderate iterations
dropout = 0.1          # Some regularization

Few-shot finetuning

For very small datasets (less than 1000 examples):
init_from = 'gpt2-xl'
learning_rate = 1e-5   # Very low learning rate
max_iters = 50         # Few iterations
dropout = 0.2          # Higher regularization
always_save_checkpoint = False  # Only save improvements

Monitoring overfitting

Watch for signs of overfitting:
step 0: train loss 3.2341, val loss 3.2156
step 5: train loss 0.8765, val loss 1.2341  # Val loss not improving
step 10: train loss 0.3421, val loss 1.3567  # Val loss increasing
If validation loss increases while training loss decreases, you’re overfitting. Reduce max_iters, increase dropout, or use a smaller model.

Best practices

Start small

Begin with gpt2 (124M) to iterate quickly, then scale to larger models

Monitor validation loss

Only save checkpoints when validation loss improves

Use BPE tokenization

Match the pretraining tokenizer (GPT-2 BPE) for best results

Tune learning rate

Start at 3e-5, decrease if unstable, increase if too slow

Next steps

Sampling

Generate text from your finetuned model

Distributed training

Scale finetuning across multiple GPUs

Build docs developers (and LLMs) love