Train Your Own GPT-2
This guide walks you through training a GPT-2 capability LLM from scratch and talking to it via a ChatGPT-like web UI. The entire process takes approximately 3 hours on an 8xH100 GPU node.Prerequisites
Hardware Requirements
- 8xH100 GPU Node (Recommended)
- Single GPU
- CPU / Apple Silicon
Optimal setup for the speedrun:
- 8x NVIDIA H100 GPUs (80GB VRAM each)
- Training time: ~2.76 hours
- Cost: ~20 on spot instances)
- Cloud providers: Lambda, AWS, GCP, Azure
The code also runs on 8xA100 GPUs (Ampere), but will take a bit longer.
Software Requirements
- Python 3.10 or higher
- uv package manager (will be installed automatically)
- (Optional) wandb account for experiment tracking
Installation
What Happens During the Speedrun
Theruns/speedrun.sh script executes the complete LLM training pipeline:
Environment Setup
The script automatically:
- Installs
uvif not present - Creates a
.venvvirtual environment - Installs all dependencies with
uv sync --extra gpu - Activates the virtual environment
Tokenizer Training
Downloads ~2B characters of pretraining data (8 shards, ~800MB) and trains a BPE tokenizer:
Pretraining
Trains a d26 (26-layer) transformer model with FP8 precision:
The
--depth=26 parameter is the single complexity dial. All other hyperparameters (width, number of heads, learning rate, etc.) are calculated automatically.Talking to Your Model
Once training is complete, you can chat with your model in two ways:- Web UI (Recommended)
- CLI
Launch the ChatGPT-like web interface:Then visit the URL shown in your terminal.Now you can chat with your LLM just like ChatGPT!
If you’re on a cloud instance (e.g., Lambda), access the UI using the public IP and port:For example:
http://209.20.xxx.xxx:8000/Expected Results
Your trained model will:- Achieve a DCLM CORE score > 0.256 (beating GPT-2)
- Have approximately 4e19 FLOPs of capability
- Be able to write stories, poems, and answer basic questions
- Exhibit “kindergartener” level intelligence
Example Interactions
Try asking your model:- “Write me a short story about a robot”
- “Why is the sky blue?”
- “What are you? Who created you?”
- “Count the letter ‘r’ in ‘strawberry’”
Timeline Breakdown
On an 8xH100 GPU node:| Stage | Duration | Description |
|---|---|---|
| Setup | ~2 min | Install dependencies, create venv |
| Tokenizer | ~5 min | Download data, train tokenizer |
| Pretraining | ~2.5 hours | Train 26-layer transformer |
| SFT | ~15 min | Fine-tune for conversation |
| Evaluation | ~5 min | Run benchmarks and generate report |
| Total | ~2.76 hours | Complete pipeline |
Next Steps
Customize Your Model
Infuse personality and custom abilities into your nanochat
Research & Experimentation
Train smaller models for rapid iteration and improvement
File Structure
Understand the codebase organization
Contributing
Help improve the state of the art in micro models
Troubleshooting
Out of Memory (OOM) errors
Out of Memory (OOM) errors
Reduce the The lower the batch size, the longer training will take, but it will fit in less VRAM.
--device-batch-size parameter:Dataset download is slow
Dataset download is slow
The script downloads ~370 data shards in the background. If your connection is slow:
- Reduce the number of shards:
python -m nanochat.dataset -n 100 - This will train a less capable model but complete faster
Can't access the web UI
Can't access the web UI
If you’re on a cloud instance:
- Make sure the port (default 8000) is open in your firewall
- Use the public IP address, not localhost:
http://YOUR_PUBLIC_IP:8000/ - Check that the server is actually running in the terminal output
Training seems stuck
Training seems stuck
Check the log output:
- Look for
train/tok_per_sec- should be processing thousands of tokens per second - If using wandb, monitor the
val_bpb(validation loss) curve - The pretraining step takes the longest (~2.5 hours) - this is normal