Skip to main content

Welcome to nanochat

nanochat is the simplest experimental harness for training large language models (LLMs). It’s designed to run on a single GPU node with minimal, hackable code that covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI.
Train your own GPT-2 capability LLM (which cost ~43,000totrainin2019)foronly43,000 to train in 2019) for only **72** (3 hours of 8xH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to **$20**.

Key Features

One Complexity Dial

Set a single --depth parameter (number of transformer layers) and all other hyperparameters are calculated automatically in an optimal way.

Complete Pipeline

Covers tokenization, pretraining, supervised fine-tuning (SFT), reinforcement learning (RL), evaluation, and chat UI - all in one repo.

Minimal & Hackable

Clean, readable PyTorch code with no giant configuration objects or if-then-else monsters. Designed to be maximally forkable.

Compute Optimal

Automatically trains compute-optimal models at various sizes by sweeping the depth parameter - no manual hyperparameter tuning needed.

The Complete Pipeline

nanochat guides you through the entire journey of creating a ChatGPT-like model:
1

Tokenization

Train a custom BPE tokenizer with vocab size 32,768 on ~2B characters of data
2

Pretraining

Train the base transformer model on 10B tokens using distributed training across 8 GPUs
3

Supervised Fine-Tuning (SFT)

Teach the model conversation patterns, tool use, and multiple choice through supervised learning
4

Reinforcement Learning (RL)

Further align the model through reinforcement learning techniques (optional)
5

Chat & Evaluate

Talk to your model via CLI or web UI, and evaluate it on benchmarks like DCLM CORE, ARC, MMLU, and GSM8K

What You’ll Build

By following the quickstart, you’ll train a GPT-2 grade capability model (4e19 FLOPs) that can:
  • Write stories and poems
  • Answer questions about the world
  • Engage in conversational dialogue
  • Use tools and execute Python code
  • Handle multiple choice questions
The model is comparable to a “kindergartener” in capability - it has basic language understanding and generation abilities, perfect for learning and experimentation.

Time-to-GPT-2 Leaderboard

nanochat maintains a leaderboard for “GPT-2 speedrun” - the wall-clock time required to train a model to GPT-2 grade capability (DCLM CORE score > 0.256525) on an 8xH100 GPU node:
#TimeVal BPBCORE ScoreDescription
0168 hours-0.2565Original OpenAI GPT-2 (2019)
13.04h0.748330.2585d24 baseline, slightly overtrained
22.91h0.745040.2578d26 slightly undertrained + fp8
32.76h0.746450.2602Bump total batch size to 1M tokens
In 2019, training GPT-2 cost approximately **43,000.Thanksto7yearsofadvancesacrossthestack,wecannowachievethesamecapabilityforwellbelow43,000**. Thanks to 7 years of advances across the stack, we can now achieve the same capability for well below 100.

Ready to Start?

Quickstart

Train your own GPT-2 in 3 hours and start chatting with it

Community & Support

For questions about the repo:

Build docs developers (and LLMs) love