Usage
Parameters
Required
Source of the model:
sft or rl.Task Selection
Task name to evaluate. Use
| to separate multiple tasks. If not specified, evaluates all tasks.Available tasks:ARC-Easy(categorical)ARC-Challenge(categorical)MMLU(categorical)GSM8K(generative)HumanEval(generative)SpellingBee(generative)
Model Selection
Model tag to load (e.g.
d24).Step to load. If not specified, loads the last checkpoint.
Generation Parameters
Floating point precision:
float32 or bfloat16.Sampling temperature.
0.0 = greedy decoding.Maximum number of new tokens to generate.
Number of samples to generate per problem (for pass@k evaluation).
Top-k sampling.
0 = disabled.Batch Size
Batch size for categorical evaluation (logit-based tasks).
Limits
Maximum number of problems to evaluate. If not specified, evaluates all problems.
Runtime
Device type:
cuda, cpu, or mps. Empty string enables autodetection.Evaluation Types
Categorical Tasks
For multiple-choice tasks (ARC-Easy, ARC-Challenge, MMLU):- Processes batches of problems in parallel
- Compares logits for answer choices (A, B, C, D)
- No generation required (more efficient)
Generative Tasks
For open-ended tasks (GSM8K, HumanEval, SpellingBee):- Generates completions for each problem
- Evaluates correctness using task-specific criteria
- Supports pass@k evaluation with multiple samples
Examples
Evaluate SFT Model on MMLU
Evaluate RL Model on GSM8K with Pass@8
Quick Evaluation (100 problems)
Evaluate All Tasks
High Temperature Sampling
ChatCORE Metric
When all tasks are evaluated, the script computes the ChatCORE metric:- Categorical tasks (ARC, MMLU): 25% (random guessing)
- Generative tasks (GSM8K, HumanEval, SpellingBee): 0%
Output
Results are logged to:- Console: Real-time progress and final accuracy
- Report: Nanochat report system with all task results
- Wandb: If configured (not in this script, but in training scripts)