Overview
nanochat’s checkpoint system saves model parameters, optimizer state, and training metadata to enable resuming training, evaluation, and deployment. Checkpoints are organized by model size and training step.Checkpoint Structure
Checkpoints are stored inbase_checkpoints/<model_tag>/:
File Contents
model_*.pt: PyTorch state dict with all model parametersSaving Checkpoints
During Training
Checkpoints are saved automatically during training based on the--save-every flag:
scripts/base_train.py:460-483:
Manual Saving
Use thesave_checkpoint() function from Python:
nanochat/checkpoint_manager.py:42-59.
Loading Checkpoints
For Resuming Training
Resume training from a specific step:- Loads model parameters
- Loads optimizer state (per-rank shards)
- Restores dataloader position
- Restores training metrics (loss EMA, time, etc.)
scripts/base_train.py:154-158:
For Evaluation
Load a model for evaluation without optimizer state:nanochat/checkpoint_manager.py:164-172.
Advanced: Direct Loading
For full control, useload_checkpoint() directly:
nanochat/checkpoint_manager.py:61-74.
Checkpoint Naming
Model Tags
By default, checkpoints used{depth} as the model tag:
--model-tag:
scripts/base_train.py:151-152:
Step Numbers
Step numbers are zero-padded to 6 digits:Checkpoint Utilities
Find Latest Checkpoint
nanochat/checkpoint_manager.py:138-144.
Find Largest Model
nanochat/checkpoint_manager.py:118-135. Attempts to parse d{depth} format, falls back to most recently modified.
Build Model from Checkpoint
nanochat/checkpoint_manager.py:77-115. Handles:
- Meta device initialization (no memory allocation)
- Weight loading with
assign=True(zero-copy) - BF16 → FP32 conversion for CPU/MPS
- Backward compatibility (patches missing config keys)
Backward Compatibility
nanochat patches missing parameters when loading old checkpoints:Missing Config Keys
nanochat/checkpoint_manager.py:23-28.
Missing Model Parameters
nanochat/checkpoint_manager.py:30-40.
Disk Usage
Checkpoint sizes depend on model depth:| Depth | Parameters | model_*.pt | optim_*.pt (per rank) | Total (8 ranks) |
|---|---|---|---|---|
| d12 | ~120M | ~480 MB | ~960 MB | ~8.2 GB |
| d20 | ~330M | ~1.3 GB | ~2.6 GB | ~22 GB |
| d26 | ~570M | ~2.3 GB | ~4.6 GB | ~39 GB |
Best Practices
During Experimentation
During Long Runs
For Reproducibility
Always save:- Model parameters (
model_*.pt) - Metadata (
meta_*.json) with full config - Training script and commit hash
Checkpoint Cleanup
Delete intermediate checkpoints to save space:Checkpoint Migration
Move checkpoints between systems:Copy to Another Machine
Resume on Different GPU Count
nanochat handles different GPU counts automatically:Troubleshooting
”No checkpoints found”
Ensure checkpoint directory exists:“Checkpoint version mismatch”
Backward compatibility patches should handle most cases. If not:“Out of memory when loading”
Use meta device for zero-memory loading:nanochat/checkpoint_manager.py:100-105.
”Optimizer state shape mismatch”
This happens when model architecture changes. Re-initialize optimizer:Further Reading
nanochat/checkpoint_manager.py- Full checkpoint implementationscripts/base_train.py:460-483- Checkpoint saving during trainingscripts/base_train.py:154-158- Checkpoint loading for resume- PyTorch Saving & Loading