Skip to main content
The fastest way to get started with nanoGPT is to train a character-level model on the works of Shakespeare. This small-scale training run completes in about 3 minutes on a GPU.

Prepare the dataset

First, download and tokenize the Shakespeare dataset:
python data/shakespeare_char/prepare.py
This creates train.bin and val.bin files containing the character-level tokenized text.

Training configurations

Train a baby GPT using the default configuration:
python train.py config/train_shakespeare_char.py

Model architecture

The configuration in config/train_shakespeare_char.py defines a small Transformer:
# Model parameters
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

# Training parameters
batch_size = 64
block_size = 256  # context of up to 256 previous characters
learning_rate = 1e-3
max_iters = 5000
This configuration trains a 6-layer Transformer with 6 attention heads and 384 feature channels.

Expected results

  • Training time: ~3 minutes on A100 GPU
  • Best validation loss: 1.4697
  • Output directory: out-shakespeare-char

Configuration parameters

Key parameters from config/train_shakespeare_char.py:
ParameterValueDescription
out_dir'out-shakespeare-char'Checkpoint directory
eval_interval250Steps between evaluations
eval_iters200Batches to use for evaluation
gradient_accumulation_steps1No gradient accumulation
batch_size64Batch size per iteration
block_size256Context length in characters
learning_rate1e-3Higher LR for baby networks
max_iters5000Total training iterations
warmup_iters100Linear warmup steps
beta20.99Adam beta2 (higher due to small batch)

Sample from the model

After training completes, generate text samples:
python sample.py --out_dir=out-shakespeare-char

Example output

After 3 minutes of training on GPU:
ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

DUKE VINCENTIO:
I thank your eyes against it.

DUKE VINCENTIO:
Then will answer him to save the malm:
And what have you tyrannous shall do this?
Character-level models produce lower quality text than BPE-tokenized models. For better results, consider finetuning a pretrained GPT-2 model on this dataset.

Advanced configuration

1

Adjust model size

Modify n_layer, n_head, and n_embd in the config file or via command line:
python train.py config/train_shakespeare_char.py --n_layer=8 --n_embd=512
2

Extend training

Increase max_iters and lr_decay_iters for longer training:
python train.py config/train_shakespeare_char.py --max_iters=10000 --lr_decay_iters=10000
3

Enable logging

Track training progress with Weights & Biases:
python train.py config/train_shakespeare_char.py --wandb_log=True

Next steps

Reproduce GPT-2

Train a 124M parameter model on OpenWebText

Finetuning

Finetune pretrained GPT-2 models on custom data

Build docs developers (and LLMs) love