Overview
Nanochat implements automatic compute-optimal scaling based on empirical scaling laws. The core insight: one parameter (depth) controls everything. As you increase--depth, the training script automatically adjusts:
- Model size (parameters)
- Training tokens (data)
- Batch size
- Learning rates
- Weight decay
The Depth Dial
Model size is controlled by a single parameter:depth=12 → base_dim=768 → model_dim=768 → num_heads=6
Scaling Law 1: Optimal Training Tokens
The compute-optimal data:param ratio is determined empirically:target_param_data_ratio = 10.5
Why 10.5? Derived from scaling laws experiments (see runs/scaling_laws.sh). This differs from Chinchilla (20:1) because:
- Smaller models benefit from more data per parameter
- Different architecture (sliding windows, value embeddings)
- Empirically optimal for nanochat’s parameter count range
Parameter Counting
Only certain parameters count toward the scaling ratio:dev/LOG.md Jan 27, 2026).
Scaling Law 2: Optimal Batch Size
Follows the Power Lines paper (arXiv:2505.13738):B_ref = 524,288tokens (optimal batch size at d12)D_ref= optimal training tokens for d12D= optimal training tokens for current depth
- Doubling training tokens → 1.3× larger batch size
- 10× more tokens → 2.4× larger batch size
Why This Matters
Using too small a batch size:- Wastes wall-clock time (more iterations needed)
- Hurts convergence (noisy gradients)
- Hurts generalization (“generalization gap”)
- Wastes compute (diminishing returns)
Scaling Law 3: Learning Rate Scaling
When batch size changes, learning rates scale as:- SGD: Linear scaling (
lr ∝ B) - AdamW: Square root scaling (
lr ∝ √B) - Muon: Assumed same as AdamW (not studied carefully)
Scaling Law 4: Weight Decay Scaling
Follows the T_epoch framework (arXiv:2405.13698):T_epoch = B / (η · λ · D) is kept constant.
Intuition: As you train longer (larger D), you need less regularization (smaller λ).
Example:
- d12:
λ = 0.2 - d20 (2.5× more tokens):
λ ≈ 0.08
Reference Model (d12)
All scaling is anchored todepth=12:
Running Scaling Laws Experiments
Theruns/scaling_laws.sh script trains models at different depths and FLOP budgets:
- Train:
--target-flops=$flops --depth=$d - Evaluate: Validation loss and CORE metric
- Log: Results saved to CSV
Output
Results saved to$NANOCHAT_BASE_DIR/scaling_laws_results_${LABEL}/results.csv:
Analysis
Plot validation loss vs FLOPs for different depths:Compute-Optimal Training
The goal: For a fixed compute budget (FLOPs), find the optimal (model_size, training_tokens) pair.Fixed FLOPs Constraint
- Larger model + fewer tokens
- Smaller model + more tokens
Chinchilla vs Nanochat
Chinchilla (DeepMind, 2022):- Ratio: 20 tokens per parameter
- Example: 10B params → 200B tokens
- Ratio: 10.5 tokens per parameter
- Example: 100M params → 1B tokens
- Chinchilla studied 400M-70B param models
- Nanochat studies 30M-600M param models
- Smaller models benefit from more parameters per token (less data needed to amortize params)
Automatic Hyperparameter Scaling
When you run:- Calculates model size:
params = f(depth, aspect_ratio, head_dim, ...) - Determines optimal tokens:
D = 10.5 × scaling_params - Computes optimal batch size:
B = B_ref × (D/D_ref)^0.383 - Scales learning rates:
lr = lr_ref × √(B/B_ref) - Adjusts weight decay:
λ = λ_ref × √(B/B_ref) × (D_ref/D) - Calculates num_iterations:
num_iters = D / B
Logged Output Example
Validation: Bits Per Byte
Instead of cross-entropy loss (bits per token), nanochat reports bits per byte:Override Parameters
You can override automatic scaling with explicit flags:Override training length
Override batch size
Override learning rates
Muon Momentum Schedule
Muon optimizer uses a momentum warmup (independent of depth):Summary
Scaling laws in nanochat:| What | Scales As | Reference |
|---|---|---|
| Model dim | depth × 64 | Linear |
| Training tokens | 10.5 × scaling_params | Empirical |
| Batch size | B_ref × (D/D_ref)^0.383 | Power Lines |
| Learning rate | lr_ref × √(B/B_ref) | AdamW theory |
| Weight decay | λ_ref × √(B/B_ref) × (D_ref/D) | T_epoch |
| Num iterations | tokens / batch_size | Derived |
Further Reading
- Chinchilla paper - Original compute-optimal scaling laws
- Power Lines paper - Optimal batch size scaling
- T_epoch paper - Weight decay scaling
- muP paper - Hyperparameter transfer across model sizes
dev/LOG.mdin nanochat source - Detailed scaling laws experiments and findings