Overview
The nanochat data loader implements a BOS-aligned best-fit algorithm for packing tokenized documents into training sequences. This approach:- Ensures every sequence starts with a BOS (beginning-of-sequence) token
- Uses best-fit packing to minimize wasted tokens
- Achieves 100% utilization (no padding)
- Handles distributed training with DDP sharding
- Supports resumption from checkpoints
Design Trade-offs
BOS-Aligned Best-Fit
Advantages:- Every token can attend back to a BOS token
- Full document context is preserved for most tokens
- Cleaner training signal (less confusion from concatenated documents)
- ~35% of tokens are cropped to maintain alignment
- More aggressive than simple concatenation
Alternative: Simple Concatenation
For limited data or very long documents, consider the original tokenizing_distributed_data_loader that concatenates documents without BOS alignment: https://github.com/karpathy/nanochat/blob/3c3a3d7/nanochat/dataloader.py#L78-L117 This approach wastes fewer tokens but produces more “confusing” examples where context switches abruptly.Algorithm
Best-Fit Packing
For each sequence of lengthT+1 (input + target):
- Find best fit: From a buffer of documents, select the largest document that fits entirely in remaining space
- Repeat: Continue adding documents until no document fits
- Fill remaining: When nothing fits, crop a document (shortest in buffer) to fill remaining space exactly
Pseudocode
Implementation Details
Function Signature
DDP Sharding
Each rank processes a disjoint subset of the data:- No data duplication across ranks
- Balanced load (assuming row groups are similar size)
- Simple implementation (no explicit coordination)
Resumption
The loader tracks position in the dataset and returns it with each batch:pq_idx: Current parquet file indexrg_idx: Current row group index within fileepoch: Number of complete passes through dataset
Multi-Epoch Support
The loader automatically cycles through the dataset infinitely:Memory Optimization
Pre-allocated Buffers
The loader uses persistent buffers to avoid repeated allocations:- Zero-copy views into contiguous memory
- Single HtoD transfer per batch
- Pinned memory for async transfer
Transfer Pipeline
Document Buffer
The best-fit algorithm maintains a buffer of tokenized documents:- Size: Configurable (default 1000 documents)
- Purpose: Provide choices for best-fit selection
- Refill: Automatically refills when buffer runs low
Tokenization
Documents are tokenized in parallel:- Batch size: 128 documents (default)
- Threads: 4 (default)
- BOS token prepended to every document
Data Format
The loader expects parquet files with a'text' column:
list_parquet_files() which looks for *.parquet in the dataset directory.
Reference: dataloader.py:35-36, dataloader.py:63-64
Split Logic
Train/val split is determined by parquet file:- Validation data is small enough to fit in a single parquet file
- Validation file is placed last in directory listing
Usage Example
Validation Loader
For validation, use the same loader withsplit="val":
Simplified Interface
For cases where you don’t need state tracking:Performance Characteristics
| Aspect | Value |
|---|---|
| Utilization | 100% (no padding) |
| Token waste | ~35% (cropping) |
| Buffer memory | ~1000 docs × avg_doc_len × 4 bytes |
| HtoD transfers | 1 per batch |
| DDP efficiency | Near-linear scaling |
Related
GPT Architecture
Model architecture overview
Optimizer
MuonAdamW optimizer details