Overview
Thenanochat.loss_eval module provides the evaluate_bpb function for evaluating language model performance using the bits per byte (bpb) metric.
evaluate_bpb
Evaluate model using bits per byte metric.Parameters
The language model to evaluate
Iterator that yields (x, y) tuples where:
xis input tensor of shape (B, T)yis target tensor of shape (B, T)
Number of evaluation steps (batches to process)
1D tensor of shape (vocab_size,) indicating the number of bytes for each token ID.Set to 0 for special tokens that should not be counted in the metric.
Returns
Bits per byte metric. Returns
float('inf') if total bytes is 0.Description
The bits per byte (bpb) metric is a tokenization vocab size-independent metric for evaluating language models. Unlike mean loss, bpb normalizes by the number of bytes that target tokens represent, making it comparable across different tokenizers and vocabulary sizes. Key features:- Normal tokens: Weighted by their length in bytes
- Special tokens: Excluded from the metric (set
token_bytes[id] = 0) - Masked tokens: Tokens with negative IDs (e.g., -1 ignore_index) are excluded
- Distributed support: Automatically aggregates results across all ranks
Formula
total_nats= sum of cross-entropy losses weighted by byte lengthtotal_bytes= sum of byte lengths for all target tokens
Example
Implementation Details
Handling ignore_index: When target tokens contain negative values (e.g., padding marked with -1):- Create a valid mask:
valid = y >= 0 - Replace negative indices with 0 before indexing
token_bytes - Zero out byte counts for invalid positions
- Only sum losses where
num_bytes > 0
- Each rank accumulates
total_natsandtotal_bytes all_reducesums across all ranks- Final bpb is computed from the global totals
Notes
- Model must implement
model(x, y, loss_reduction='none')returning shape (B, T) - Works with any device (CUDA, CPU, MPS)
- Handles edge case where
total_bytes == 0by returning infinity - More reliable than mean loss when comparing models with different tokenizers