Overview
You’ll learn how to:- Prepare the Shakespeare dataset for training
- Train a small GPT model from scratch
- Generate text samples from your trained model
- Adjust hyperparameters for different hardware
This quickstart uses character-level modeling (not BPE tokens) for simplicity. The entire dataset is just 1MB and training takes only a few minutes.
Prepare the dataset
First, download and prepare the Shakespeare dataset:Run the data preparation script
The preparation script downloads the tiny Shakespeare dataset and converts it to binary format:This script:
- Downloads the complete works of Shakespeare (~1MB text file)
- Creates a character-to-integer mapping
- Splits the data into training (90%) and validation (10%) sets
- Saves
train.binandval.binindata/shakespeare_char/
Train the model
Now you can train a GPT model on this data. The training approach depends on your hardware.- With GPU
- CPU / Low-end hardware
If you have a GPU, you can train a small but capable model using the provided config file:Model checkpoints are saved to
Model architecture
The config fileconfig/train_shakespeare_char.py defines a “baby GPT” with:Training progress
On an A100 GPU, this takes about 3 minutes and achieves a validation loss around 1.47:out-shakespeare-char/ directory.Generate text samples
Once training completes, generate Shakespeare-style text from your model:--device=cpu flag:
Example output
With the GPU-trained model (validation loss 1.47), you might see:Customize generation
You can control the generation process with additional parameters:The prompt to start generation. Can also load from a file with
FILE:prompt.txtNumber of independent samples to generate
Maximum number of tokens to generate per sample
Sampling temperature. Lower values (0.6-0.8) are more conservative, higher values (1.0+) more creative
Only sample from the top k most likely tokens at each step
Understanding the training loop
Let’s examine what happens during training. The core training loop intrain.py follows this pattern:
train.py
Next steps
Finetune GPT-2
Learn how to finetune pretrained GPT-2 models on your own data for better results
Training configuration
Explore all available hyperparameters and configuration options
Character-level training
Deep dive into character-level model training and configuration
Distributed training
Scale up to multi-GPU training with PyTorch DDP
Common issues
Training is very slow
Training is very slow
- Ensure you’re using a GPU with
--device=cuda(or--device=mpson Mac) - Verify PyTorch 2.0+ is installed to enable
torch.compile() - Check that compilation is enabled (don’t use
--compile=Falseon GPU)
Out of memory errors
Out of memory errors
Reduce memory usage by:
Loss not decreasing
Loss not decreasing
- Verify the data preparation completed successfully
- Check that
train.binandval.binexist indata/shakespeare_char/ - Try increasing the learning rate or reducing dropout
Generated text is nonsense
Generated text is nonsense
- The model may need more training iterations
- Try lowering the temperature:
--temperature=0.7 - Check that you’re loading the correct checkpoint with
--out_dir