Official Guides
Recent
Beating GPT-2 for <<$100
Detailed writeup of the nanochat journey to train GPT-2 capability for under $100. Covers the baseline run achieving 3.04 hours on 8XH100.Published: Feb 1, 2026
Jan 7 Miniseries v1
Documentation of the first nanochat miniseries of models. Explains scaling laws, the depth parameter, and how hyperparameters are automatically calculated.Published: Jan 7, 2026
Capabilities
Counting 'r' in strawberry
Guide on adding new abilities to nanochat. Uses letter counting as an example to demonstrate how to teach models new skills through synthetic data and task design.
Infusing Identity
How to customize your nanochat’s personality through synthetic data generation and mixing that data into the SFT stage. Make your model respond with a unique voice.
Historical
Original nanochat Post
The Oct 13, 2025 post introducing nanochat. Note that some information is now deprecated and the model has significantly improved since then.Published: Oct 13, 2025
Getting Started
Quick Start Paths
I want to train GPT-2:- Boot an 8XH100 GPU node
- Run
bash runs/speedrun.sh(takes ~3 hours) - Chat with your model using
python -m scripts.chat_web - See Leaderboard for optimization tips
- Start with a d12 model (~5 minute training runs)
- Try different depths:
--depth=12,--depth=16,--depth=20 - Monitor WandB for improvements in
val_bpbandcore_metric - See Contributing for best practices
- Read the counting ‘r’ in strawberry guide
- Create a new Task in
tasks/directory - Generate synthetic training data if needed
- Add to SFT training mixture
Development Resources
Core Scripts
runs/speedrun.sh- Current state-of-the-art GPT-2 trainingruns/scaling_laws.sh- Scaling law experimentsruns/miniseries.sh- Train complete miniseriesruns/runcpu.sh- Small example for CPU/Apple Silicon
Key Documentation
dev/LOG.md- Development log with detailed explanations of changesdev/LEADERBOARD.md- Full leaderboard documentation- README.md - Project overview and quick start
Example Commands
Train a custom depth model:Community Channels
GitHub Discussions
Q&A, ideas, and announcements. Best for design discussions and detailed technical questions.
Discord #nanochat
Real-time chat and community support. Great for quick questions and debugging help.
DeepWiki
AI-powered search through the nanochat codebase. Ask questions about specific functions or implementation details.
Research Topics
Active Areas
Pretraining efficiency:- Training speed optimizations
- Better scaling laws
- Data curriculum strategies
- Mixed precision techniques
- Attention mechanism improvements
- Normalization strategies
- Initialization methods
- Parameter-efficient designs
- Additional benchmarks
- Faster evaluation methods
- Better metrics for small models
- SFT improvements
- RL training strategies
- Multi-task learning
- Synthetic data generation
Scaling Laws
nanochat uses compute-optimal scaling with a single complexity dial (--depth). All other hyperparameters scale automatically:
- Model width (hidden dimension)
- Number of attention heads
- Learning rate schedules
- Training horizons (token:param ratio)
- Weight decay values
- Batch sizes (as of Run 3)
--target-param-data-ratio to overtrain or undertrain.
Tips and Tricks
Hardware
8XH100 (recommended):- ~3/GPU/hr)
- GPT-2 in ~3 hours = ~$72
- Spot instances: ~$20 total
- Works fine, slightly slower
- More widely available
- Automatic gradient accumulation
- 8x longer training time
- Useful for debugging
- Reduce
--device-batch-sizeto 16, 8, 4, 2, or 1 - Must be power of 2 for clean gradient accumulation
Training
OOM issues:Evaluation
Quick CORE eval:val_bpb- Validation loss (bits per byte)core_metric- DCLM CORE scoretrain/mfu- Model FLOPs utilizationtrain/tok_per_sec- Training throughput
Related Projects
nanoGPT
The original minimal GPT implementation focusing only on pretraining. nanochat extends this with tokenization, fine-tuning, evaluation, inference, and chat UI.
modded-nanoGPT
Community-driven optimization of nanoGPT with leaderboard. Inspired nanochat’s leaderboard approach.
Citation
If you find nanochat helpful in your research:Contributing Your Own Guide
Written a tutorial or achieved interesting results? Share with the community:- Post in GitHub Discussions
- Tag with appropriate labels (guide, tutorial, results, etc.)
- Include:
- Clear title and motivation
- Step-by-step instructions
- Code examples
- Results and benchmarks
- Links to resources