Tokenizing Data Loader

Overview

The nanochat data loader implements a BOS-aligned best-fit algorithm for packing tokenized documents into training sequences. This approach:

Ensures every sequence starts with a BOS (beginning-of-sequence) token
Uses best-fit packing to minimize wasted tokens
Achieves 100% utilization (no padding)
Handles distributed training with DDP sharding
Supports resumption from checkpoints

Design Trade-offs

BOS-Aligned Best-Fit

Advantages:

Every token can attend back to a BOS token
Full document context is preserved for most tokens
Cleaner training signal (less confusion from concatenated documents)

Cost:

~35% of tokens are cropped to maintain alignment
More aggressive than simple concatenation

Reference: dataloader.py:4-16

Alternative: Simple Concatenation

For limited data or very long documents, consider the original tokenizing_distributed_data_loader that concatenates documents without BOS alignment: https://github.com/karpathy/nanochat/blob/3c3a3d7/nanochat/dataloader.py#L78-L117 This approach wastes fewer tokens but produces more “confusing” examples where context switches abruptly.

Algorithm

Best-Fit Packing

For each sequence of length T+1 (input + target):

Find best fit: From a buffer of documents, select the largest document that fits entirely in remaining space
Repeat: Continue adding documents until no document fits
Fill remaining: When nothing fits, crop a document (shortest in buffer) to fill remaining space exactly

This is a greedy approximation to the bin-packing problem, optimized for simplicity and speed. Reference: dataloader.py:85-94

Pseudocode

for each row in batch:
    pos = 0
    while pos < sequence_length:
        # Find largest doc that fits
        best_doc = max(doc for doc in buffer if len(doc) <= remaining)
        
        if best_doc exists:
            row[pos:pos+len(best_doc)] = best_doc
            pos += len(best_doc)
        else:
            # Crop shortest doc to fill exactly
            shortest_doc = min(buffer, key=len)
            row[pos:] = shortest_doc[:remaining]
            pos = sequence_length

Reference: dataloader.py:122-150

Implementation Details

Function Signature

def tokenizing_distributed_data_loader_with_state_bos_bestfit(
    tokenizer,           # Tokenizer instance
    B,                   # Batch size
    T,                   # Sequence length
    split,               # "train" or "val"
    tokenizer_threads=4, # Parallel tokenization threads
    tokenizer_batch_size=128,
    device="cuda",
    resume_state_dict=None,  # For resuming from checkpoint
    buffer_size=1000,    # Document buffer size for best-fit
):

Reference: dataloader.py:73-78

DDP Sharding

Each rank processes a disjoint subset of the data:

# Each rank reads different row groups
rg_idx = ddp_rank  # Start offset
while rg_idx < num_row_groups:
    process(rg_idx)
    rg_idx += ddp_world_size  # Stride by world size

This ensures:

No data duplication across ranks
Balanced load (assuming row groups are similar size)
Simple implementation (no explicit coordination)

Reference: dataloader.py:61-67

Resumption

The loader tracks position in the dataset and returns it with each batch:

for inputs, targets, state_dict in loader:
    # state_dict = {"pq_idx": ..., "rg_idx": ..., "epoch": ...}
    train_step(inputs, targets)
    if checkpoint:
        save(state_dict)

When resuming:

state = load_checkpoint()["dataloader_state"]
loader = dataloader(..., resume_state_dict=state)

pq_idx: Current parquet file index
rg_idx: Current row group index within file
epoch: Number of complete passes through dataset

The loader advances by 1 row group on resume to avoid repeating data. Reference: dataloader.py:39-59, dataloader.py:156

Multi-Epoch Support

The loader automatically cycles through the dataset infinitely:

while True:  # Multi-epoch loop
    for pq_file in parquet_files:
        for row_group in pq_file:
            yield batch
    epoch += 1  # Track epoch count

Reference: dataloader.py:46-70

Memory Optimization

Pre-allocated Buffers

The loader uses persistent buffers to avoid repeated allocations:

# Allocate once at initialization
row_buffer = torch.empty((B, T+1), dtype=torch.long)
cpu_buffer = torch.empty(2*B*T, dtype=torch.long, pin_memory=True)
gpu_buffer = torch.empty(2*B*T, dtype=torch.long, device="cuda")

# Views into buffers
cpu_inputs = cpu_buffer[:B*T].view(B, T)
cpu_targets = cpu_buffer[B*T:].view(B, T)

This enables:

Zero-copy views into contiguous memory
Single HtoD transfer per batch
Pinned memory for async transfer

Reference: dataloader.py:110-119

Transfer Pipeline

# 1. Build batch in row_buffer (CPU)
for row in range(B):
    pack_documents_into_row(row_buffer[row])

# 2. Copy to pinned CPU buffer (inputs and targets)
cpu_inputs.copy_(row_buffer[:, :-1])
cpu_targets.copy_(row_buffer[:, 1:])

# 3. Single async HtoD transfer
gpu_buffer.copy_(cpu_buffer, non_blocking=True)

# 4. Yield views into GPU buffer
yield inputs, targets  # No copy, just views

Reference: dataloader.py:152-160

Document Buffer

The best-fit algorithm maintains a buffer of tokenized documents:

Size: Configurable (default 1000 documents)
Purpose: Provide choices for best-fit selection
Refill: Automatically refills when buffer runs low

doc_buffer = []  # List of token lists

def refill_buffer():
    doc_batch = next(parquet_iterator)
    token_lists = tokenizer.encode(doc_batch, prepend=bos_token)
    doc_buffer.extend(token_lists)

Trade-off: Larger buffer → better packing, but more memory usage and startup latency. Reference: dataloader.py:100-108, dataloader.py:125-127

Tokenization

Documents are tokenized in parallel:

token_lists = tokenizer.encode(
    doc_batch,
    prepend=bos_token,
    num_threads=4
)

Batch size: 128 documents (default)
Threads: 4 (default)
BOS token prepended to every document

Reference: dataloader.py:106

Data Format

The loader expects parquet files with a 'text' column:

rg = parquet_file.read_row_group(rg_idx)
batch = rg.column('text').to_pylist()  # List of strings

Files are discovered via list_parquet_files() which looks for *.parquet in the dataset directory. Reference: dataloader.py:35-36, dataloader.py:63-64

Split Logic

Train/val split is determined by parquet file:

parquet_paths = list_parquet_files()
if split == "train":
    parquet_paths = parquet_paths[:-1]  # All but last file
else:  # "val"
    parquet_paths = parquet_paths[-1:]  # Last file only

This assumes:

Validation data is small enough to fit in a single parquet file
Validation file is placed last in directory listing

Reference: dataloader.py:37

Usage Example

from nanochat.dataloader import tokenizing_distributed_data_loader_with_state_bos_bestfit
from nanochat.tokenizer import Tokenizer

tokenizer = Tokenizer("path/to/tokenizer.model")

# Training
train_loader = tokenizing_distributed_data_loader_with_state_bos_bestfit(
    tokenizer=tokenizer,
    B=16,
    T=2048,
    split="train",
    device="cuda",
)

for step, (inputs, targets, state) in enumerate(train_loader):
    loss = model(inputs, targets)
    loss.backward()
    optimizer.step()
    
    if step % 1000 == 0:
        checkpoint = {
            "model": model.state_dict(),
            "optimizer": optimizer.state_dict(),
            "dataloader": state,
        }
        torch.save(checkpoint, f"checkpoint_{step}.pt")

Validation Loader

For validation, use the same loader with split="val":

val_loader = tokenizing_distributed_data_loader_with_state_bos_bestfit(
    tokenizer=tokenizer,
    B=16,
    T=2048,
    split="val",
    device="cuda",
)

# Validation loop (no state saving needed)
for inputs, targets, _ in itertools.islice(val_loader, 100):
    with torch.no_grad():
        loss = model(inputs, targets)
        val_losses.append(loss.item())

Simplified Interface

For cases where you don’t need state tracking:

from nanochat.dataloader import tokenizing_distributed_data_loader_bos_bestfit

for inputs, targets in loader:
    # No state_dict in output
    train_step(inputs, targets)

Reference: dataloader.py:162-165

Performance Characteristics

Aspect	Value
Utilization	100% (no padding)
Token waste	~35% (cropping)
Buffer memory	~1000 docs × avg_doc_len × 4 bytes
HtoD transfers	1 per batch
DDP efficiency	Near-linear scaling

GPT Architecture

Model architecture overview

Optimizer

MuonAdamW optimizer details

Get Started

Training

Evaluation

Inference

Architecture

Advanced

Tokenizing Data Loader

Overview

Design Trade-offs

BOS-Aligned Best-Fit

Alternative: Simple Concatenation

Algorithm

Best-Fit Packing

Pseudocode

Implementation Details

Function Signature

DDP Sharding

Resumption

Multi-Epoch Support

Memory Optimization

Pre-allocated Buffers

Transfer Pipeline

Document Buffer

Tokenization

Data Format

Split Logic

Usage Example

Validation Loader

Simplified Interface

Performance Characteristics

GPT Architecture

Optimizer

Build docs developers (and LLMs) love

Get Started

Training

Evaluation

Inference

Architecture

Advanced

​Overview

​Design Trade-offs

​BOS-Aligned Best-Fit

​Alternative: Simple Concatenation

​Algorithm

​Best-Fit Packing

​Pseudocode

​Implementation Details

​Function Signature

​DDP Sharding

​Resumption

​Multi-Epoch Support

​Memory Optimization

​Pre-allocated Buffers

​Transfer Pipeline

​Document Buffer

​Tokenization

​Data Format

​Split Logic

​Usage Example

​Validation Loader

​Simplified Interface

​Performance Characteristics

​Related

GPT Architecture

Optimizer

Build docs developers (and LLMs) love

Overview

Design Trade-offs

BOS-Aligned Best-Fit

Alternative: Simple Concatenation

Algorithm

Best-Fit Packing

Pseudocode

Implementation Details

Function Signature

DDP Sharding

Resumption

Multi-Epoch Support

Memory Optimization

Pre-allocated Buffers

Transfer Pipeline

Document Buffer

Tokenization

Data Format

Split Logic

Usage Example

Validation Loader

Simplified Interface

Performance Characteristics

Related