Philosophy
Accessibility is about overall cost but also about cognitive complexity. nanochat is not an exhaustively configurable LLM “framework”: ❌ No giant configuration objects❌ No model factories
❌ No if-then-else monsters in the codebase ✅ Single, cohesive, minimal codebase
✅ Readable and hackable
✅ Maximally forkable “strong baseline”
✅ Runs start to end to produce a ChatGPT model you can talk to
Current Focus
The most interesting area of contribution is speeding up the time to GPT-2 (achieving a CORE score above 0.256525). Currently this takes ~3 hours on an 8XH100 node, but we can improve it further by optimizing the pretraining stage. See the Time-to-GPT-2 Leaderboard for details on how to participate.Contribution Guidelines
Code Quality
- Keep code minimal, readable, and hackable
- Avoid adding abstraction layers or configuration complexity
- Don’t significantly bloat the codebase
- Avoid esoteric or overly specialized optimizations
Principled Improvements
nanochat cares about training an entire miniseries of models, not just targeting a single model size. Your changes must: ✅ Generalize across different model depths (--depth parameter)✅ Work for the full range of model sizes (not just d24 or d26)
✅ Maintain the “single dial of complexity” philosophy The depth parameter automatically determines all other hyperparameters (width, heads, learning rate, training horizon, weight decay, etc.) so models come out compute-optimal. Users shouldn’t have to think about these details.
Submitting Changes
-
Test across depths: Verify your change works for multiple
--depthsettings (e.g., d12, d16, d20, d24) -
Measure improvements: Show gains in:
- Training time (wall clock)
- Validation loss (
val_bpb) - CORE metric
- Efficiency (MFU, throughput)
- Document your approach: Explain the reasoning and any tradeoffs
-
Create a PR: Include:
- Clear description of the change
- Performance improvements with evidence
- Any AI-assisted code (see policy below)
AI Contribution Policy
Disclosure required. When submitting a PR, please declare:- Any parts with substantial LLM contribution
- Code you have not written personally
- Code you do not fully understand
Development Workflow
Quick Iteration
For rapid experimentation (~5 minutes per run), train a d12 model:- Validation loss curves
- Training throughput
- Final CORE score
Scaling Laws
For deeper analysis, run scaling law experiments:Full Miniseries
To train the complete miniseries across all depths:Monitoring
Watch these WandB metrics:- Loss curves:
val_bpbvs.step,total_training_time,total_training_flops - Capability:
core_metric(DCLM CORE score) - Efficiency:
train/mfu,train/tok_per_sec, VRAM usage
Areas to Contribute
Pretraining Optimization
- Training efficiency improvements
- Better hyperparameter scaling across depths
- Data loading and preprocessing speedups
- Mixed precision strategies
Model Architecture
- Architecture improvements that generalize
- Attention mechanisms
- Normalization strategies
- Initialization methods
Evaluation
- Additional task implementations
- Improved evaluation metrics
- Faster evaluation methods
Fine-tuning
- SFT improvements
- RL training enhancements
- New capabilities (see counting r in strawberry guide)
Documentation
- Tutorials and guides
- Example notebooks
- Architecture explanations
- Performance optimization tips
What NOT to Contribute
❌ Configuration complexity: Giant YAML configs, complex factories, excessive abstraction ❌ Single-model optimizations: Tweaks that only work for d24 or d26 ❌ Framework bloat: Trying to make nanochat support every possible use case ❌ Breaking changes: Modifications that fundamentally alter the simplicity philosophy Remember: nanochat is intentionally not a framework. It’s a strong baseline that should stay minimal and hackable.Getting Help
- DeepWiki: Use DeepWiki to ask questions about the repo
- Discussions: GitHub Discussions for design questions and ideas
- Discord: #nanochat channel for real-time help
- Issues: GitHub Issues for bug reports
Community Resources
- Leaderboard - Time-to-GPT-2 competition
- Guides - Tutorials and writeups
- GitHub Discussions - Q&A and announcements
Recognition
Contributors who improve the leaderboard get:- Credit in the leaderboard table
- Recognition in commit history
- Mention in related writeups and discussions
Acknowledgements
nanochat benefits from the broader community:- Inspired by nanoGPT and modded-nanoGPT
- Built on datasets from HuggingFace
- Developed with compute from Lambda
- Guidance from Alec Radford
- Repo management by @svlandeg