Finetuning a pretrained GPT-2 model is identical to training from scratch, except you initialize from a checkpoint and use a smaller learning rate. This allows you to adapt large models to your specific domain with minimal compute.
Quick start
Finetune GPT-2 XL (1.5B parameters) on Shakespeare in just a few minutes:
Prepare the dataset
Download and tokenize Shakespeare using GPT-2 BPE: python data/shakespeare/prepare.py
This creates train.bin and val.bin using the OpenAI BPE tokenizer (unlike character-level encoding).
Run finetuning
python train.py config/finetune_shakespeare.py
This loads the pretrained gpt2-xl checkpoint and finetunes it on Shakespeare.
Sample from the model
python sample.py --out_dir=out-shakespeare
Finetuning configuration
The config/finetune_shakespeare.py file shows a complete finetuning setup:
import time
out_dir = 'out-shakespeare'
eval_interval = 5
eval_iters = 40
wandb_log = False
dataset = 'shakespeare'
init_from = 'gpt2-xl' # Largest GPT-2 model (1558M params)
# Only save checkpoints if validation loss improves
always_save_checkpoint = False
# Batch size calculation:
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# Shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20
# Finetune at constant learning rate
learning_rate = 3e-5
decay_lr = False
Key differences from pretraining
Parameter Pretraining Finetuning init_from'scratch''gpt2-xl'learning_rate6e-43e-5 (10-20x lower)decay_lrTrueFalse (constant LR)max_iters600000 20 (much shorter) dropout0.00.1+ (optional regularization)
Available pretrained models
You can initialize from any OpenAI GPT-2 checkpoint:
Model Parameters Context Length Memory (fp16) gpt2124M 1024 ~500MB gpt2-medium350M 1024 ~1.5GB gpt2-large774M 1024 ~3GB gpt2-xl1558M 1024 ~6GB
Set the model in your config:
init_from = 'gpt2-xl' # or 'gpt2', 'gpt2-medium', 'gpt2-large'
Initialization logic
From train.py:181-189, the script downloads and loads pretrained weights:
elif init_from.startswith( 'gpt2' ):
print ( f "Initializing from OpenAI GPT-2 weights: { init_from } " )
# Initialize from OpenAI GPT-2 weights
override_args = dict ( dropout = dropout)
model = GPT .from_pretrained(init_from, override_args)
# Read config params for checkpointing
for k in [ 'n_layer' , 'n_head' , 'n_embd' , 'block_size' , 'bias' , 'vocab_size' ]:
model_args[k] = getattr (model.config, k)
This automatically downloads weights from Hugging Face on first use.
Preparing custom datasets
Using BPE tokenization
For best results, use the same GPT-2 BPE tokenizer as pretraining:
import tiktoken
import numpy as np
# Load GPT-2 tokenizer
enc = tiktoken.get_encoding( "gpt2" )
# Tokenize your text
with open ( 'your_data.txt' , 'r' ) as f:
text = f.read()
ids = enc.encode_ordinary(text)
ids.append(enc.eot_token) # Add end-of-text token
# Save as binary file
arr = np.array(ids, dtype = np.uint16)
arr.tofile( 'train.bin' )
Train/validation split
Create separate files:
# Split data 90/10
split_idx = int ( len (ids) * 0.9 )
train_ids = ids[:split_idx]
val_ids = ids[split_idx:]
np.array(train_ids, dtype = np.uint16).tofile( 'train.bin' )
np.array(val_ids, dtype = np.uint16).tofile( 'val.bin' )
Directory structure
data/
└── your_dataset/
├── prepare.py # Tokenization script
├── train.bin # Training data
├── val.bin # Validation data
└── meta.pkl # Vocabulary metadata (optional)
Finetuning hyperparameters
Learning rate
Use a much lower learning rate than pretraining:
learning_rate = 3e-5 # Typical range: 1e-5 to 1e-4
decay_lr = False # Often use constant LR
Too high a learning rate can destroy pretrained knowledge. Start conservatively and increase if training is too slow.
Regularization
Add dropout to prevent overfitting on small datasets:
dropout = 0.1 # 0.1 to 0.2 for finetuning (0.0 for pretraining)
From the override logic:
override_args = dict ( dropout = dropout)
model = GPT .from_pretrained(init_from, override_args)
Training duration
Finetuning requires far fewer iterations:
max_iters = 20 # For tiny datasets like Shakespeare
# Scale up for larger datasets:
# - Small dataset (1M tokens): 100-500 iters
# - Medium dataset (100M tokens): 1000-5000 iters
# - Large dataset (1B+ tokens): 10000+ iters
Batch size
Adjust based on dataset size and memory:
# Small dataset (Shakespeare: 300K tokens)
batch_size = 1
gradient_accumulation_steps = 32
# Effective batch size: 1 * 32 * 1024 = 32,768 tokens
# Larger dataset
batch_size = 4
gradient_accumulation_steps = 8
# Effective batch size: 4 * 8 * 1024 = 32,768 tokens
Example finetuning output
After finetuning GPT-2 XL on Shakespeare:
THEODORE:
Thou shalt sell me to the highest bidder: if I die,
I sell thee to the first; if I go mad,
I sell thee to the second; if I
lie, I sell thee to the third; if I slay,
I sell thee to the fourth: so buy or sell,
I tell thee again, thou shalt not sell my
possession.
JULIET:
And if thou steal, thou shalt not sell thyself.
THEODORE:
I do not steal; I sell the stolen goods.
THEODORE:
Thou know'st not what thou sell'st; thou, a woman,
Thou art ever a victim, a thing of no worth:
Thou hast no right, no right, but to be sold.
The model generates coherent Shakespearean-style dialogue with proper formatting and character names after just ~20 iterations.
Memory management
Reduce memory usage
If you run out of GPU memory:
Use a smaller model
Decrease batch size
Decrease context length
init_from = 'gpt2' # 124M params instead of gpt2-xl (1558M)
batch_size = 1
gradient_accumulation_steps = 16 # Reduce from 32
block_size = 512 # Reduce from 1024
This crops the pretrained model’s context: if block_size < model.config.block_size:
model.crop_block_size(block_size)
model_args[ 'block_size' ] = block_size
Multi-GPU finetuning
Scale finetuning across multiple GPUs:
torchrun --standalone --nproc_per_node=4 train.py config/finetune_shakespeare.py
The same DDP logic applies as in pretraining. See Distributed Training for details.
Resume from checkpoint
Resume finetuning from a saved checkpoint:
init_from = 'resume'
out_dir = 'out-shakespeare' # Directory with ckpt.pt
Run:
python train.py config/finetune_shakespeare.py --init_from=resume
From train.py:158-180:
elif init_from == 'resume' :
print ( f "Resuming training from { out_dir } " )
ckpt_path = os.path.join(out_dir, 'ckpt.pt' )
checkpoint = torch.load(ckpt_path, map_location = device)
checkpoint_model_args = checkpoint[ 'model_args' ]
# Restore model architecture
for k in [ 'n_layer' , 'n_head' , 'n_embd' , 'block_size' , 'bias' , 'vocab_size' ]:
model_args[k] = checkpoint_model_args[k]
gptconf = GPTConfig( ** model_args)
model = GPT(gptconf)
model.load_state_dict(checkpoint[ 'model' ])
# Restore training state
iter_num = checkpoint[ 'iter_num' ]
best_val_loss = checkpoint[ 'best_val_loss' ]
Domain adaptation strategies
Continue pretraining
For large domain-specific corpora (e.g., medical, legal, code):
init_from = 'gpt2-xl'
learning_rate = 1e-4 # Slightly higher than pure finetuning
max_iters = 10000 # More iterations
decay_lr = True # Use learning rate decay
Task-specific finetuning
For specific tasks (e.g., summarization, Q&A):
init_from = 'gpt2-xl'
learning_rate = 3e-5 # Low learning rate
max_iters = 1000 # Moderate iterations
dropout = 0.1 # Some regularization
Few-shot finetuning
For very small datasets (less than 1000 examples):
init_from = 'gpt2-xl'
learning_rate = 1e-5 # Very low learning rate
max_iters = 50 # Few iterations
dropout = 0.2 # Higher regularization
always_save_checkpoint = False # Only save improvements
Monitoring overfitting
Watch for signs of overfitting:
step 0: train loss 3.2341, val loss 3.2156
step 5: train loss 0.8765, val loss 1.2341 # Val loss not improving
step 10: train loss 0.3421, val loss 1.3567 # Val loss increasing
If validation loss increases while training loss decreases, you’re overfitting. Reduce max_iters, increase dropout, or use a smaller model.
Best practices
Start small Begin with gpt2 (124M) to iterate quickly, then scale to larger models
Monitor validation loss Only save checkpoints when validation loss improves
Use BPE tokenization Match the pretraining tokenizer (GPT-2 BPE) for best results
Tune learning rate Start at 3e-5, decrease if unstable, increase if too slow
Next steps
Sampling Generate text from your finetuned model
Distributed training Scale finetuning across multiple GPUs