Loss Evaluation

Overview

Bits per byte (BPB) is a tokenization-independent metric for evaluating language model loss. Unlike standard cross-entropy loss, BPB normalizes by the number of UTF-8 bytes in the target text, making it comparable across different tokenizers and vocabulary sizes.

Why Bits Per Byte?

Standard loss has a critical limitation:

Different vocab sizes produce different losses - A 50K vocab tokenizer will have different average loss than a 100K vocab tokenizer, even for the same underlying model quality
Can’t compare apples to apples - Changing tokenization breaks loss comparisons

BPB solves this by normalizing loss by the actual information content (bytes) rather than token count.

How BPB is Calculated

Formula

bpb = total_nats / (math.log(2) * total_bytes)

Where:

total_nats = Sum of cross-entropy losses across all tokens
total_bytes = Sum of UTF-8 byte lengths for all tokens
math.log(2) converts from nats (natural log) to bits (log base 2)

Step-by-Step Process

Compute per-token losses

loss2d = model(x, y, loss_reduction='none')  # (B, T)

Get byte lengths for each token

token_bytes = get_token_bytes(device=device)  # (vocab_size,)
num_bytes = token_bytes[y]  # Look up bytes for target tokens

Accumulate weighted losses

total_nats += (loss2d * (num_bytes > 0)).sum()
total_bytes += num_bytes.sum()

Convert to bits per byte

bpb = total_nats / (math.log(2) * total_bytes)

Implementation

From nanochat/loss_eval.py:evaluate_bpb:

@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
    total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
    total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
    
    batch_iter = iter(batches)
    for _ in range(steps):
        x, y = next(batch_iter)
        loss2d = model(x, y, loss_reduction='none')  # (B, T)
        loss2d = loss2d.view(-1)
        y = y.view(-1)
        
        # Get byte lengths for each target token
        num_bytes2d = token_bytes[y]
        
        # Accumulate (only count tokens with num_bytes > 0)
        total_nats += (loss2d * (num_bytes2d > 0)).sum()
        total_bytes += num_bytes2d.sum()
    
    # Aggregate across GPUs if distributed
    if dist.is_initialized():
        dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
        dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)
    
    # Convert to BPB
    bpb = total_nats.item() / (math.log(2) * total_bytes.item())
    return bpb

Token Bytes Tensor

The token_bytes tensor maps each token ID to its UTF-8 byte length:

from nanochat.tokenizer import get_token_bytes

token_bytes = get_token_bytes(device=device)  # Shape: (vocab_size,)

Special handling:

Special tokens (e.g., <|bos|>, <|eos|>) have token_bytes[id] = 0
These are excluded from the metric (both from nats and bytes)
Tokens with ignore_index = -1 are also excluded

Handling Special Cases

Ignored Tokens

Some tokens may be masked out (e.g., ignore_index = -1):

if (y.int() < 0).any():
    # Don't index with negative values
    valid = y >= 0
    y_safe = torch.where(valid, y, torch.zeros_like(y))
    num_bytes = torch.where(
        valid,
        token_bytes[y_safe],
        torch.zeros_like(y, dtype=token_bytes.dtype)
    )
    total_nats += (loss2d * (num_bytes > 0)).sum()
    total_bytes += num_bytes.sum()

Special Tokens

Tokens like <|bos|> automatically contribute 0 bytes and are excluded.

Running BPB Evaluation

Full Evaluation

torchrun --nproc_per_node=8 -m scripts.base_eval \
  --model-tag d24 \
  --eval bpb \
  --device-batch-size 32 \
  --split-tokens 20971520

Quick Evaluation

python -m scripts.base_eval \
  --model-tag d24 \
  --eval bpb \
  --device-batch-size 16 \
  --split-tokens 524288

Parameters

device-batch-size

int

default:"32"

Batch size per GPU

split-tokens

int

default:"20971520"

Total tokens to evaluate per split (train/val). Must be divisible by device-batch-size * sequence_len * num_gpus

Interpreting Results

Training vs Validation BPB

train bpb: 1.234567
val bpb: 1.456789

Lower is better - Lower BPB means better compression/prediction
Val > Train - Expected, indicates some overfitting
Val >> Train - Significant overfitting, model not generalizing well
Val ≈ Train - Good generalization

Comparing Models

Because BPB is tokenization-independent:

model_a_bpb = 1.23  # 50K vocab
model_b_bpb = 1.25  # 100K vocab
# Model A is better, despite different vocab sizes!

Distributed Evaluation

BPB evaluation automatically distributes across GPUs:

# Each GPU processes independent batches
tokens_per_step = batch_size * seq_len * world_size
steps = split_tokens // tokens_per_step

# Results are aggregated with all_reduce
dist.all_reduce(total_nats, op=dist.ReduceOp.SUM)
dist.all_reduce(total_bytes, op=dist.ReduceOp.SUM)

Example Output

================================================================================
BPB Evaluation
================================================================================
train bpb: 1.234567
val bpb: 1.456789

Results are also logged to the report:

from nanochat.report import get_report

get_report().log(section="Base model evaluation", data=[
    {"model": "base_model (step 10000)",
     "train bpb": 1.234567,
     "val bpb": 1.456789}
])

Reference

Implementation: nanochat/loss_eval.py:evaluate_bpb

Get Started

Training

Evaluation

Inference

Architecture

Advanced

Overview

Why Bits Per Byte?

How BPB is Calculated

Formula

Step-by-Step Process

Implementation

Token Bytes Tensor

Handling Special Cases

Ignored Tokens

Special Tokens

Running BPB Evaluation

Full Evaluation

Quick Evaluation

Parameters

Interpreting Results

Training vs Validation BPB

Comparing Models

Distributed Evaluation

Example Output

Reference

Build docs developers (and LLMs) love

Get Started

Training

Evaluation

Inference

Architecture

Advanced

​Overview

​Why Bits Per Byte?

​How BPB is Calculated

​Formula

​Step-by-Step Process

​Implementation

​Token Bytes Tensor

​Handling Special Cases

​Ignored Tokens

​Special Tokens

​Running BPB Evaluation

​Full Evaluation

​Quick Evaluation

​Parameters

​Interpreting Results

​Training vs Validation BPB

​Comparing Models

​Distributed Evaluation

​Example Output

​Reference

Build docs developers (and LLMs) love

Overview

Why Bits Per Byte?

How BPB is Calculated

Formula

Step-by-Step Process

Implementation

Token Bytes Tensor

Handling Special Cases

Ignored Tokens

Special Tokens

Running BPB Evaluation

Full Evaluation

Quick Evaluation

Parameters

Interpreting Results

Training vs Validation BPB

Comparing Models

Distributed Evaluation

Example Output

Reference