Skip to main content
Evaluation script for chat models supporting both categorical and generative tasks.

Usage

# Single task
python -m scripts.chat_eval -i sft -a ARC-Easy

# Distributed (8 GPUs)
torchrun --nproc_per_node=8 -m scripts.chat_eval -- -i sft -a ARC-Easy

# Multiple tasks
python -m scripts.chat_eval -i rl -a "MMLU|GSM8K|HumanEval"

# All tasks
python -m scripts.chat_eval -i sft -a all

Parameters

Required

-i, --source
str
required
Source of the model: sft or rl.

Task Selection

-a, --task-name
str
default:"None"
Task name to evaluate. Use | to separate multiple tasks. If not specified, evaluates all tasks.Available tasks:
  • ARC-Easy (categorical)
  • ARC-Challenge (categorical)
  • MMLU (categorical)
  • GSM8K (generative)
  • HumanEval (generative)
  • SpellingBee (generative)

Model Selection

-g, --model-tag
str
default:"None"
Model tag to load (e.g. d24).
-s, --step
int
default:"None"
Step to load. If not specified, loads the last checkpoint.

Generation Parameters

-d, --dtype
str
default:"bfloat16"
Floating point precision: float32 or bfloat16.
-t, --temperature
float
default:"0.0"
Sampling temperature. 0.0 = greedy decoding.
-m, --max-new-tokens
int
default:"512"
Maximum number of new tokens to generate.
-n, --num-samples
int
default:"1"
Number of samples to generate per problem (for pass@k evaluation).
-k, --top-k
int
default:"50"
Top-k sampling. 0 = disabled.

Batch Size

-b, --batch-size
int
default:"8"
Batch size for categorical evaluation (logit-based tasks).

Limits

-x, --max-problems
int
default:"None"
Maximum number of problems to evaluate. If not specified, evaluates all problems.

Runtime

--device-type
str
default:""
Device type: cuda, cpu, or mps. Empty string enables autodetection.

Evaluation Types

Categorical Tasks

For multiple-choice tasks (ARC-Easy, ARC-Challenge, MMLU):
  • Processes batches of problems in parallel
  • Compares logits for answer choices (A, B, C, D)
  • No generation required (more efficient)

Generative Tasks

For open-ended tasks (GSM8K, HumanEval, SpellingBee):
  • Generates completions for each problem
  • Evaluates correctness using task-specific criteria
  • Supports pass@k evaluation with multiple samples

Examples

Evaluate SFT Model on MMLU

torchrun --nproc_per_node=8 -m scripts.chat_eval \
  -i sft \
  -a MMLU \
  --model-tag d24

Evaluate RL Model on GSM8K with Pass@8

python -m scripts.chat_eval \
  -i rl \
  -a GSM8K \
  --model-tag d24 \
  -n 8 \
  -t 1.0

Quick Evaluation (100 problems)

python -m scripts.chat_eval \
  -i sft \
  -a HumanEval \
  --max-problems 100

Evaluate All Tasks

torchrun --nproc_per_node=8 -m scripts.chat_eval \
  -i sft \
  --model-tag d24

High Temperature Sampling

python -m scripts.chat_eval \
  -i rl \
  -a GSM8K \
  -t 1.5 \
  -k 100 \
  -n 16

ChatCORE Metric

When all tasks are evaluated, the script computes the ChatCORE metric:
centered_accuracy = (accuracy - baseline) / (1.0 - baseline)
chatcore_metric = mean(centered_accuracy for all tasks)
Baselines:
  • Categorical tasks (ARC, MMLU): 25% (random guessing)
  • Generative tasks (GSM8K, HumanEval, SpellingBee): 0%
The ChatCORE metric ranges from 0 (random performance) to 1 (perfect performance).

Output

Results are logged to:
  • Console: Real-time progress and final accuracy
  • Report: Nanochat report system with all task results
  • Wandb: If configured (not in this script, but in training scripts)

Build docs developers (and LLMs) love