Usage
Parameters
Evaluation Modes
Comma-separated evaluations to run:
core- CORE metric (accuracy on ICL tasks)bpb- Bits per byte on train/val splitssample- Generate samples from the model
Model Selection
HuggingFace model path (e.g.
openai-community/gpt2-xl). Use this to evaluate external models.Nanochat model tag to identify the checkpoint directory (e.g.
d24).Model step to load. If not specified, loads the last checkpoint.
CORE Evaluation
Maximum examples per CORE task.
-1 = evaluate on all examples.BPB Evaluation
Per-device batch size for bits-per-byte evaluation.
Number of tokens to evaluate per split for bits-per-byte (default: 40*524288).
Runtime
Device type:
cuda, cpu, or mps. Empty string enables autodetection.Examples
Evaluate CORE Only
Quick CORE Evaluation (100 examples per task)
Evaluate Bits-Per-Byte Only
Evaluate HuggingFace Model
Output
Results are written to:- CORE results:
{base_dir}/base_eval/{model_slug}.csv - Console output: Summary statistics for all evaluations
- Report: Logged to nanochat report system