Prepare the dataset
First, download and tokenize the Shakespeare dataset:train.bin and val.bin files containing the character-level tokenized text.
Training configurations
- GPU training
- CPU training
- Apple Silicon
Train a baby GPT using the default configuration:
Model architecture
The configuration inconfig/train_shakespeare_char.py defines a small Transformer:This configuration trains a 6-layer Transformer with 6 attention heads and 384 feature channels.
Expected results
- Training time: ~3 minutes on A100 GPU
- Best validation loss: 1.4697
- Output directory:
out-shakespeare-char
Configuration parameters
Key parameters fromconfig/train_shakespeare_char.py:
| Parameter | Value | Description |
|---|---|---|
out_dir | 'out-shakespeare-char' | Checkpoint directory |
eval_interval | 250 | Steps between evaluations |
eval_iters | 200 | Batches to use for evaluation |
gradient_accumulation_steps | 1 | No gradient accumulation |
batch_size | 64 | Batch size per iteration |
block_size | 256 | Context length in characters |
learning_rate | 1e-3 | Higher LR for baby networks |
max_iters | 5000 | Total training iterations |
warmup_iters | 100 | Linear warmup steps |
beta2 | 0.99 | Adam beta2 (higher due to small batch) |
Sample from the model
After training completes, generate text samples:Example output
After 3 minutes of training on GPU:Advanced configuration
Next steps
Reproduce GPT-2
Train a 124M parameter model on OpenWebText
Finetuning
Finetune pretrained GPT-2 models on custom data