chat_eval.py

Evaluation script for chat models supporting both categorical and generative tasks.

Usage

# Single task
python -m scripts.chat_eval -i sft -a ARC-Easy

# Distributed (8 GPUs)
torchrun --nproc_per_node=8 -m scripts.chat_eval -- -i sft -a ARC-Easy

# Multiple tasks
python -m scripts.chat_eval -i rl -a "MMLU|GSM8K|HumanEval"

# All tasks
python -m scripts.chat_eval -i sft -a all

Parameters

Required

-i, --source

str

required

Source of the model: sft or rl.

Task Selection

-a, --task-name

str

default:"None"

Task name to evaluate. Use | to separate multiple tasks. If not specified, evaluates all tasks.Available tasks:

ARC-Easy (categorical)
ARC-Challenge (categorical)
MMLU (categorical)
GSM8K (generative)
HumanEval (generative)
SpellingBee (generative)

Model Selection

-g, --model-tag

str

default:"None"

Model tag to load (e.g. d24).

-s, --step

int

default:"None"

Step to load. If not specified, loads the last checkpoint.

Generation Parameters

-d, --dtype

str

default:"bfloat16"

Floating point precision: float32 or bfloat16.

-t, --temperature

float

default:"0.0"

Sampling temperature. 0.0 = greedy decoding.

-m, --max-new-tokens

int

default:"512"

Maximum number of new tokens to generate.

-n, --num-samples

int

default:"1"

Number of samples to generate per problem (for pass@k evaluation).

-k, --top-k

int

default:"50"

Top-k sampling. 0 = disabled.

Batch Size

-b, --batch-size

int

default:"8"

Batch size for categorical evaluation (logit-based tasks).

Limits

-x, --max-problems

int

default:"None"

Maximum number of problems to evaluate. If not specified, evaluates all problems.

Runtime

--device-type

str

default:""

Device type: cuda, cpu, or mps. Empty string enables autodetection.

Evaluation Types

Categorical Tasks

For multiple-choice tasks (ARC-Easy, ARC-Challenge, MMLU):

Processes batches of problems in parallel
Compares logits for answer choices (A, B, C, D)
No generation required (more efficient)

Generative Tasks

For open-ended tasks (GSM8K, HumanEval, SpellingBee):

Generates completions for each problem
Evaluates correctness using task-specific criteria
Supports pass@k evaluation with multiple samples

Examples

Evaluate SFT Model on MMLU

torchrun --nproc_per_node=8 -m scripts.chat_eval \
  -i sft \
  -a MMLU \
  --model-tag d24

Evaluate RL Model on GSM8K with Pass@8

python -m scripts.chat_eval \
  -i rl \
  -a GSM8K \
  --model-tag d24 \
  -n 8 \
  -t 1.0

Quick Evaluation (100 problems)

python -m scripts.chat_eval \
  -i sft \
  -a HumanEval \
  --max-problems 100

Evaluate All Tasks

torchrun --nproc_per_node=8 -m scripts.chat_eval \
  -i sft \
  --model-tag d24

High Temperature Sampling

python -m scripts.chat_eval \
  -i rl \
  -a GSM8K \
  -t 1.5 \
  -k 100 \
  -n 16

ChatCORE Metric

When all tasks are evaluated, the script computes the ChatCORE metric:

centered_accuracy = (accuracy - baseline) / (1.0 - baseline)
chatcore_metric = mean(centered_accuracy for all tasks)

Baselines:

Categorical tasks (ARC, MMLU): 25% (random guessing)
Generative tasks (GSM8K, HumanEval, SpellingBee): 0%

The ChatCORE metric ranges from 0 (random performance) to 1 (perfect performance).

Output

Results are logged to:

Console: Real-time progress and final accuracy
Report: Nanochat report system with all task results
Wandb: If configured (not in this script, but in training scripts)

Core Modules

Training

Scripts

Tasks

Usage

Parameters

Required

Task Selection

Model Selection

Generation Parameters

Batch Size

Limits

Runtime

Evaluation Types

Categorical Tasks

Generative Tasks

Examples

Evaluate SFT Model on MMLU

Evaluate RL Model on GSM8K with Pass@8

Quick Evaluation (100 problems)

Evaluate All Tasks

High Temperature Sampling

ChatCORE Metric

Output

Build docs developers (and LLMs) love

Core Modules

Training

Scripts

Tasks

​Usage

​Parameters

​Required

​Task Selection

​Model Selection

​Generation Parameters

​Batch Size

​Limits

​Runtime

​Evaluation Types

​Categorical Tasks

​Generative Tasks

​Examples

​Evaluate SFT Model on MMLU

​Evaluate RL Model on GSM8K with Pass@8

​Quick Evaluation (100 problems)

​Evaluate All Tasks

​High Temperature Sampling

​ChatCORE Metric

​Output

Build docs developers (and LLMs) love

Usage

Parameters

Required

Task Selection

Model Selection

Generation Parameters

Batch Size

Limits

Runtime

Evaluation Types

Categorical Tasks

Generative Tasks

Examples

Evaluate SFT Model on MMLU

Evaluate RL Model on GSM8K with Pass@8

Quick Evaluation (100 problems)

Evaluate All Tasks

High Temperature Sampling

ChatCORE Metric

Output