Contributing - nanochat

The goal of nanochat is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000.

Philosophy

Accessibility is about overall cost but also about cognitive complexity. nanochat is not an exhaustively configurable LLM “framework”: ❌ No giant configuration objects
❌ No model factories
❌ No if-then-else monsters in the codebase ✅ Single, cohesive, minimal codebase
✅ Readable and hackable
✅ Maximally forkable “strong baseline”
✅ Runs start to end to produce a ChatGPT model you can talk to

Current Focus

The most interesting area of contribution is speeding up the time to GPT-2 (achieving a CORE score above 0.256525). Currently this takes ~3 hours on an 8XH100 node, but we can improve it further by optimizing the pretraining stage. See the Time-to-GPT-2 Leaderboard for details on how to participate.

Contribution Guidelines

Code Quality

Keep code minimal, readable, and hackable
Avoid adding abstraction layers or configuration complexity
Don’t significantly bloat the codebase
Avoid esoteric or overly specialized optimizations

Principled Improvements

nanochat cares about training an entire miniseries of models, not just targeting a single model size. Your changes must: ✅ Generalize across different model depths (--depth parameter)
✅ Work for the full range of model sizes (not just d24 or d26)
✅ Maintain the “single dial of complexity” philosophy The depth parameter automatically determines all other hyperparameters (width, heads, learning rate, training horizon, weight decay, etc.) so models come out compute-optimal. Users shouldn’t have to think about these details.

Submitting Changes

Test across depths: Verify your change works for multiple --depth settings (e.g., d12, d16, d20, d24)
Measure improvements: Show gains in:
- Training time (wall clock)
- Validation loss (val_bpb)
- CORE metric
- Efficiency (MFU, throughput)
Document your approach: Explain the reasoning and any tradeoffs
Create a PR: Include:
- Clear description of the change
- Performance improvements with evidence
- Any AI-assisted code (see policy below)

AI Contribution Policy

Disclosure required. When submitting a PR, please declare:

Any parts with substantial LLM contribution
Code you have not written personally
Code you do not fully understand

This helps maintain code quality and ensures contributors understand what they’re submitting.

Development Workflow

Quick Iteration

For rapid experimentation (~5 minutes per run), train a d12 model:

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12-experiment" \
    --model-tag="d12_experiment" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1

Change something in the code, re-run d12 (or d16), and see if it improves:

Validation loss curves
Training throughput
Final CORE score

Scaling Laws

For deeper analysis, run scaling law experiments:

bash runs/scaling_laws.sh

See Jan 7 miniseries v1 for documentation.

Full Miniseries

To train the complete miniseries across all depths:

bash runs/miniseries.sh

Monitoring

Watch these WandB metrics:

Loss curves: val_bpb vs. step, total_training_time, total_training_flops
Capability: core_metric (DCLM CORE score)
Efficiency: train/mfu, train/tok_per_sec, VRAM usage

Example run analysis

Areas to Contribute

Pretraining Optimization

Training efficiency improvements
Better hyperparameter scaling across depths
Data loading and preprocessing speedups
Mixed precision strategies

Model Architecture

Architecture improvements that generalize
Attention mechanisms
Normalization strategies
Initialization methods

Evaluation

Additional task implementations
Improved evaluation metrics
Faster evaluation methods

Fine-tuning

SFT improvements
RL training enhancements
New capabilities (see counting r in strawberry guide)

Documentation

Tutorials and guides
Example notebooks
Architecture explanations
Performance optimization tips

What NOT to Contribute

❌ Configuration complexity: Giant YAML configs, complex factories, excessive abstraction ❌ Single-model optimizations: Tweaks that only work for d24 or d26 ❌ Framework bloat: Trying to make nanochat support every possible use case ❌ Breaking changes: Modifications that fundamentally alter the simplicity philosophy Remember: nanochat is intentionally not a framework. It’s a strong baseline that should stay minimal and hackable.

Getting Help

DeepWiki: Use DeepWiki to ask questions about the repo
Discussions: GitHub Discussions for design questions and ideas
Discord: #nanochat channel for real-time help
Issues: GitHub Issues for bug reports

Community Resources

Leaderboard - Time-to-GPT-2 competition
Guides - Tutorials and writeups
GitHub Discussions - Q&A and announcements

Recognition

Contributors who improve the leaderboard get:

Credit in the leaderboard table
Recognition in commit history
Mention in related writeups and discussions

Acknowledgements

nanochat benefits from the broader community:

Inspired by nanoGPT and modded-nanoGPT
Built on datasets from HuggingFace
Developed with compute from Lambda
Guidance from Alec Radford
Repo management by @svlandeg

Thank you to everyone who contributes to making powerful language models accessible!

Resources

​Philosophy

​Current Focus

​Contribution Guidelines

​Code Quality

​Principled Improvements

​Submitting Changes

​AI Contribution Policy

​Development Workflow

​Quick Iteration

​Scaling Laws

​Full Miniseries

​Monitoring

​Areas to Contribute

​Pretraining Optimization

​Model Architecture

​Evaluation

​Fine-tuning

​Documentation

​What NOT to Contribute

​Getting Help

​Community Resources

​Recognition

​Acknowledgements

Build docs developers (and LLMs) love