What is nanoGPT?
nanoGPT is a minimal implementation of GPT (Generative Pre-trained Transformer) that you can use to:Train from scratch
Build and train your own GPT models on custom datasets
Finetune pretrained models
Adapt GPT-2 checkpoints to your specific use case
Reproduce GPT-2
Recreate OpenAI’s GPT-2 (124M) on OpenWebText
Experiment quickly
Hackable codebase with just ~600 lines of core code
Architecture overview
nanoGPT consists of two main files that contain all the essential functionality:model.py (~300 lines)
Defines the GPT model architecture with these key components:- Transformer blocks: Causal self-attention and feed-forward layers
- GPTConfig: Configuration dataclass for model hyperparameters
- GPT class: Main model with training and generation methods
model.py
train.py (~300 lines)
A clean training loop with:- Distributed training: PyTorch DDP support for multi-GPU setups
- Mixed precision: Automatic FP16/BF16 training
- Configuration system: Override defaults via command line or config files
- Checkpoint management: Automatic saving and resuming
Key features
Simple and readable code
Simple and readable code
The entire codebase is intentionally minimal. The core training loop (
train.py) and model definition (model.py) are each around 300 lines, making the code easy to understand and modify.PyTorch 2.0 optimized
PyTorch 2.0 optimized
Uses
torch.compile() for significant speed improvements. Flash Attention is automatically enabled when available for efficient self-attention computation.Flexible configuration
Flexible configuration
Override any training parameter via command line arguments or configuration files. Config files in
config/ directory provide tested hyperparameters for different use cases.GPT-2 compatible
GPT-2 compatible
Can load OpenAI’s GPT-2 checkpoints (124M, 350M, 774M, 1558M parameters) for finetuning or evaluation.
Performance
On a single 8xA100 40GB node, nanoGPT can reproduce GPT-2 (124M parameters) on OpenWebText in approximately 4 days, achieving a validation loss of ~2.85. For quick experimentation, you can train a character-level model on Shakespeare in just 3 minutes on a single GPU.Use cases
nanoGPT is designed for:- Researchers who want to quickly experiment with GPT architectures
- Students learning about transformer language models
- Developers needing to finetune GPT-2 on domain-specific text
- Anyone wanting a hackable, understandable GPT implementation
nanoGPT prioritizes simplicity over features. If you need production-ready inference, extensive model variants, or comprehensive tooling, consider using libraries like Hugging Face Transformers.
Next steps
Installation
Set up your environment and install dependencies
Quickstart
Train your first GPT model in minutes