Train Your Own GPT-2

This guide walks you through training a GPT-2 capability LLM from scratch and talking to it via a ChatGPT-like web UI. The entire process takes approximately 3 hours on an 8xH100 GPU node.

Prerequisites

Hardware Requirements

8xH100 GPU Node (Recommended)
Single GPU
CPU / Apple Silicon

Optimal setup for the speedrun:

8x NVIDIA H100 GPUs (80GB VRAM each)
Training time: ~2.76 hours
Cost: ~ $72 (or ~$ 20 on spot instances)
Cloud providers: Lambda, AWS, GCP, Azure

The code also runs on 8xA100 GPUs (Ampere), but will take a bit longer.

Budget option:

Any single GPU with 80GB VRAM
Training time: ~8x longer (24 hours)
Code automatically switches to gradient accumulation
No code changes needed - just omit torchrun

If your GPU has less than 80GB VRAM, you’ll need to reduce --device-batch-size (from default 32 to 16, 8, 4, 2, or 1) to avoid OOM errors.

For experimentation only:

See runs/runcpu.sh for a minimal example
Dramatically shrinks the model to fit in reasonable time
Training time: tens of minutes
Will not produce strong results

Software Requirements

Python 3.10 or higher
uv package manager (will be installed automatically)
(Optional) wandb account for experiment tracking

Installation

Clone the repository

git clone https://github.com/karpathy/nanochat.git
cd nanochat

Run the speedrun script

The entire pipeline is contained in a single script:

bash runs/speedrun.sh

This will take ~3 hours to complete. We recommend running it in a screen session:

screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh

To enable wandb logging:

# First login to wandb
wandb login

# Then run with WANDB_RUN environment variable
WANDB_RUN=speedrun screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh

What Happens During the Speedrun

The runs/speedrun.sh script executes the complete LLM training pipeline:

Environment Setup

The script automatically:

Installs uv if not present
Creates a .venv virtual environment
Installs all dependencies with uv sync --extra gpu
Activates the virtual environment

Tokenizer Training

Downloads ~2B characters of pretraining data (8 shards, ~800MB) and trains a BPE tokenizer:

# Download initial data
python -m nanochat.dataset -n 8

# Download full dataset in background (~350 shards for 10B tokens)
python -m nanochat.dataset -n 370 &

# Train tokenizer with vocab size 32,768
python -m scripts.tok_train

# Evaluate tokenizer compression ratio
python -m scripts.tok_eval

Pretraining

Trains a d26 (26-layer) transformer model with FP8 precision:

# Train base model (this is the longest step - ~2.5 hours)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
  --depth=26 \
  --target-param-data-ratio=8.25 \
  --device-batch-size=16 \
  --fp8 \
  --run=$WANDB_RUN

# Evaluate: CORE metric, bits per byte, and generate samples
torchrun --standalone --nproc_per_node=8 -m scripts.base_eval -- \
  --device-batch-size=16

The --depth=26 parameter is the single complexity dial. All other hyperparameters (width, number of heads, learning rate, etc.) are calculated automatically.

Supervised Fine-Tuning (SFT)

Teaches the model conversation patterns and tool use:

# Download synthetic identity conversations (2.3MB)
curl -L -o ~/.cache/nanochat/identity_conversations.jsonl \
  https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

# Run SFT
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
  --device-batch-size=16 \
  --run=$WANDB_RUN

# Evaluate the chat model
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft

Generate Report

Creates a comprehensive markdown report with all results:

python -m nanochat.report generate

Talking to Your Model

Once training is complete, you can chat with your model in two ways:

Web UI (Recommended)
CLI

Launch the ChatGPT-like web interface:

# Make sure the virtual environment is active
source .venv/bin/activate

# Start the web server
python -m scripts.chat_web

Then visit the URL shown in your terminal.

If you’re on a cloud instance (e.g., Lambda), access the UI using the public IP and port:

http://YOUR_PUBLIC_IP:8000/

For example: http://209.20.xxx.xxx:8000/

Now you can chat with your LLM just like ChatGPT!

Chat directly from the command line:

# Ask a single question
python -m scripts.chat_cli -p "Why is the sky blue?"

# Interactive chat (omit the -p flag)
python -m scripts.chat_cli

Expected Results

Your trained model will:

Achieve a DCLM CORE score > 0.256 (beating GPT-2)
Have approximately 4e19 FLOPs of capability
Be able to write stories, poems, and answer basic questions
Exhibit “kindergartener” level intelligence

Example Interactions

Try asking your model:

“Write me a short story about a robot”
“Why is the sky blue?”
“What are you? Who created you?”
“Count the letter ‘r’ in ‘strawberry’”

This is a small model trained from scratch, so expect some hallucinations and mistakes. It’s perfect for learning and experimentation!

Timeline Breakdown

On an 8xH100 GPU node:

Stage	Duration	Description
Setup	~2 min	Install dependencies, create venv
Tokenizer	~5 min	Download data, train tokenizer
Pretraining	~2.5 hours	Train 26-layer transformer
SFT	~15 min	Fine-tune for conversation
Evaluation	~5 min	Run benchmarks and generate report
Total	~2.76 hours	Complete pipeline

Next Steps

Customize Your Model

Infuse personality and custom abilities into your nanochat

Research & Experimentation

Train smaller models for rapid iteration and improvement

File Structure

Understand the codebase organization

Contributing

Help improve the state of the art in micro models

Troubleshooting

Out of Memory (OOM) errors

Reduce the --device-batch-size parameter:

# In runs/speedrun.sh, change from default 32 to lower values
--device-batch-size=16  # or 8, 4, 2, 1

The lower the batch size, the longer training will take, but it will fit in less VRAM.

Dataset download is slow

The script downloads ~370 data shards in the background. If your connection is slow:

Reduce the number of shards: python -m nanochat.dataset -n 100
This will train a less capable model but complete faster

Can't access the web UI

If you’re on a cloud instance:

Make sure the port (default 8000) is open in your firewall
Use the public IP address, not localhost: http://YOUR_PUBLIC_IP:8000/
Check that the server is actually running in the terminal output

Training seems stuck

Check the log output:

Look for train/tok_per_sec - should be processing thousands of tokens per second
If using wandb, monitor the val_bpb (validation loss) curve
The pretraining step takes the longest (~2.5 hours) - this is normal

Community Support

Need help? Reach out to the community:

Get Started

Training

Evaluation

Inference

Architecture

Advanced

Quickstart

Train Your Own GPT-2

Prerequisites

Hardware Requirements

Software Requirements

Installation

What Happens During the Speedrun

Talking to Your Model

Expected Results

Example Interactions

Timeline Breakdown

Next Steps

Customize Your Model

Research & Experimentation

File Structure

Contributing

Troubleshooting

Community Support

Build docs developers (and LLMs) love

Get Started

Training

Evaluation

Inference

Architecture

Advanced

​Train Your Own GPT-2

​Prerequisites

​Hardware Requirements

​Software Requirements

​Installation

​What Happens During the Speedrun

​Talking to Your Model

​Expected Results

​Example Interactions

​Timeline Breakdown

​Next Steps

Customize Your Model

Research & Experimentation

File Structure

Contributing

​Troubleshooting

​Community Support

Build docs developers (and LLMs) love

Train Your Own GPT-2

Prerequisites

Hardware Requirements

Software Requirements

Installation

What Happens During the Speedrun

Talking to Your Model

Expected Results

Example Interactions

Timeline Breakdown

Next Steps

Troubleshooting

Community Support