Skip to main content

Train Your Own GPT-2

This guide walks you through training a GPT-2 capability LLM from scratch and talking to it via a ChatGPT-like web UI. The entire process takes approximately 3 hours on an 8xH100 GPU node.

Prerequisites

Hardware Requirements

Software Requirements

  • Python 3.10 or higher
  • uv package manager (will be installed automatically)
  • (Optional) wandb account for experiment tracking

Installation

1

Clone the repository

git clone https://github.com/karpathy/nanochat.git
cd nanochat
2

Run the speedrun script

The entire pipeline is contained in a single script:
bash runs/speedrun.sh
This will take ~3 hours to complete. We recommend running it in a screen session:
screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
To enable wandb logging:
# First login to wandb
wandb login

# Then run with WANDB_RUN environment variable
WANDB_RUN=speedrun screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh

What Happens During the Speedrun

The runs/speedrun.sh script executes the complete LLM training pipeline:
1

Environment Setup

The script automatically:
  • Installs uv if not present
  • Creates a .venv virtual environment
  • Installs all dependencies with uv sync --extra gpu
  • Activates the virtual environment
2

Tokenizer Training

Downloads ~2B characters of pretraining data (8 shards, ~800MB) and trains a BPE tokenizer:
# Download initial data
python -m nanochat.dataset -n 8

# Download full dataset in background (~350 shards for 10B tokens)
python -m nanochat.dataset -n 370 &

# Train tokenizer with vocab size 32,768
python -m scripts.tok_train

# Evaluate tokenizer compression ratio
python -m scripts.tok_eval
3

Pretraining

Trains a d26 (26-layer) transformer model with FP8 precision:
# Train base model (this is the longest step - ~2.5 hours)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
  --depth=26 \
  --target-param-data-ratio=8.25 \
  --device-batch-size=16 \
  --fp8 \
  --run=$WANDB_RUN

# Evaluate: CORE metric, bits per byte, and generate samples
torchrun --standalone --nproc_per_node=8 -m scripts.base_eval -- \
  --device-batch-size=16
The --depth=26 parameter is the single complexity dial. All other hyperparameters (width, number of heads, learning rate, etc.) are calculated automatically.
4

Supervised Fine-Tuning (SFT)

Teaches the model conversation patterns and tool use:
# Download synthetic identity conversations (2.3MB)
curl -L -o ~/.cache/nanochat/identity_conversations.jsonl \
  https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

# Run SFT
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
  --device-batch-size=16 \
  --run=$WANDB_RUN

# Evaluate the chat model
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft
5

Generate Report

Creates a comprehensive markdown report with all results:
python -m nanochat.report generate

Talking to Your Model

Once training is complete, you can chat with your model in two ways:

Expected Results

Your trained model will:
  • Achieve a DCLM CORE score > 0.256 (beating GPT-2)
  • Have approximately 4e19 FLOPs of capability
  • Be able to write stories, poems, and answer basic questions
  • Exhibit “kindergartener” level intelligence

Example Interactions

Try asking your model:
  • “Write me a short story about a robot”
  • “Why is the sky blue?”
  • “What are you? Who created you?”
  • “Count the letter ‘r’ in ‘strawberry’”
This is a small model trained from scratch, so expect some hallucinations and mistakes. It’s perfect for learning and experimentation!

Timeline Breakdown

On an 8xH100 GPU node:
StageDurationDescription
Setup~2 minInstall dependencies, create venv
Tokenizer~5 minDownload data, train tokenizer
Pretraining~2.5 hoursTrain 26-layer transformer
SFT~15 minFine-tune for conversation
Evaluation~5 minRun benchmarks and generate report
Total~2.76 hoursComplete pipeline

Next Steps

Customize Your Model

Infuse personality and custom abilities into your nanochat

Research & Experimentation

Train smaller models for rapid iteration and improvement

File Structure

Understand the codebase organization

Contributing

Help improve the state of the art in micro models

Troubleshooting

Reduce the --device-batch-size parameter:
# In runs/speedrun.sh, change from default 32 to lower values
--device-batch-size=16  # or 8, 4, 2, 1
The lower the batch size, the longer training will take, but it will fit in less VRAM.
The script downloads ~370 data shards in the background. If your connection is slow:
  • Reduce the number of shards: python -m nanochat.dataset -n 100
  • This will train a less capable model but complete faster
If you’re on a cloud instance:
  1. Make sure the port (default 8000) is open in your firewall
  2. Use the public IP address, not localhost: http://YOUR_PUBLIC_IP:8000/
  3. Check that the server is actually running in the terminal output
Check the log output:
  • Look for train/tok_per_sec - should be processing thousands of tokens per second
  • If using wandb, monitor the val_bpb (validation loss) curve
  • The pretraining step takes the longest (~2.5 hours) - this is normal

Community Support

Need help? Reach out to the community:

Build docs developers (and LLMs) love