Evaluation Tasks - nanochat

Overview

Nanochat supports multiple evaluation tasks covering reasoning, coding, and general conversation. Each task is used to measure specific capabilities of the chat model.

Task Types

Tasks fall into two evaluation categories:

Categorical

Model selects from predefined answer choices. Evaluation checks logits at specific positions without generation. Tasks: ARC-Easy, ARC-Challenge, MMLU

Generative

Model generates free-form responses. Evaluation checks if the generation satisfies success criteria. Tasks: GSM8K, HumanEval, SpellingBee, SmolTalk

Available Tasks

ARC (AI2 Reasoning Challenge)

Type: Categorical
Source: allenai/ai2_arc
Subsets: ARC-Easy, ARC-Challenge
Random Baseline: 25% (4-choice multiple choice)

Description

Science questions from standardized tests. ARC-Easy contains simpler questions, while ARC-Challenge focuses on questions that require deeper reasoning.

Implementation

From tasks/arc.py:ARC:

from tasks.arc import ARC

# Load the task
task = ARC(subset="ARC-Easy", split="test")

# Get an example
conversation = task[0]
# {
#   "messages": [
#     {"role": "user", "content": "Question: ...\n\nA) ...\nB) ...\nC) ...\nD) ..."},
#     {"role": "assistant", "content": "A"}
#   ],
#   "letters": ["A", "B", "C", "D"]
# }

# Evaluate
is_correct = task.evaluate(conversation, "A")

Evaluation Method

def evaluate(self, conversation, assistant_response):
    assistant_message = conversation['messages'][-1]['content']
    return assistant_response == assistant_message

MMLU (Massive Multitask Language Understanding)

Type: Categorical
Source: cais/mmlu
Subsets: all (57 subjects), auxiliary_train
Random Baseline: 25% (4-choice multiple choice)

Description

Covers 57 subjects including STEM, humanities, social sciences, and more. Tests world knowledge and reasoning across diverse domains. Subjects include: abstract_algebra, anatomy, astronomy, business_ethics, clinical_knowledge, college_biology, computer_security, medical_genetics, professional_law, world_religions, and 47 more.

Implementation

From tasks/mmlu.py:MMLU:

from tasks.mmlu import MMLU

# Load all subjects
task = MMLU(subset="all", split="test")

# Get an example
conversation = task[0]
# {
#   "messages": [...],
#   "subject": "college_biology",
#   "letters": ("A", "B", "C", "D")
# }

Evaluation Method

Identical to ARC - checks if predicted letter matches the ground truth.

GSM8K (Grade School Math)

Type: Generative
Source: openai/gsm8k
Random Baseline: 0% (open-ended)

Description

Grade-school level math word problems. Solutions use tool calls (calculator) embedded in reasoning steps.

Example Format

Question:
Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?

Answer:
Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
#### 10

Implementation

From tasks/gsm8k.py:GSM8K:

from tasks.gsm8k import GSM8K

task = GSM8K(subset="main", split="test")

# Conversations include tool calls
conversation = task[0]
# {
#   "messages": [
#     {"role": "user", "content": "Weng earns $12..."},
#     {"role": "assistant", "content": [
#       {"type": "text", "text": "Weng earns 12/60 = $"},
#       {"type": "python", "text": "12/60"},
#       {"type": "python_output", "text": "0.2"},
#       {"type": "text", "text": "0.2 per minute...\n#### 10"}
#     ]}
#   ]
# }

Evaluation Method

def evaluate(self, conversation, assistant_response):
    # Extract ground truth after #### marker
    ref_num = extract_answer(ground_truth)  # e.g., "10"
    # Extract prediction after #### marker
    pred_num = extract_answer(assistant_response)
    # Compare
    return int(pred_num == ref_num)

Answer extraction regex: #### (\-?[0-9\.\,]+)

HumanEval

Type: Generative
Source: openai/openai_humaneval
Random Baseline: 0% (open-ended)

Description

Python programming problems. Model completes function implementations and is evaluated by executing test cases.

Implementation

From tasks/humaneval.py:HumanEval:

from tasks.humaneval import HumanEval

task = HumanEval()

conversation = task[0]
# {
#   "messages": [
#     {"role": "user", "content": "def add(a, b):\n    ..."},
#     {"role": "assistant", "content": "def add(a, b):\n    return a + b"}
#   ],
#   "entry_point": "add",
#   "test": "def check(candidate): ..."
# }

Evaluation Method

def evaluate(self, conversation, completion):
    # Extract imports from prompt
    imports = extract_imports(conversation['messages'][0]['content'])
    # Extract code from completion (handles markdown blocks)
    code = extract_program(completion)
    # Build full program
    program = (
        imports + "\n\n" +
        code + "\n\n" +
        conversation['test'] + "\n" +
        f"check({conversation['entry_point']})"
    )
    # Execute and check if it passes
    result = execute_code(program)
    return result.success

Handles code wrapped in markdown:

```python\n...\n```
```\n...\n```
Plain code without markdown blocks

SpellingBee

Type: Generative
Source: Synthetic (generated from dwyl/english-words)
Random Baseline: 0% (open-ended)

Description

Counting letter occurrences in words. Teaches models to spell words correctly and verify with Python.

Example

User: How many r are in strawberry?

Assistant:
We are asked to find the number 'r' in the word 'strawberry'. Let me try a manual approach first.

First spell the word out:
strawberry:s,t,r,a,w,b,e,r,r,y

Then count the occurrences of 'r':
1:s
2:t
3:r hit! count=1
4:a
5:w
6:b
7:e
8:r hit! count=2
9:r hit! count=3
10:y

This gives us 3.

Let me double check this using Python:

<<'strawberry'.count('r')=3>>

Python gives us 3.

My final answer is:

#### 3

Implementation

From tasks/spellingbee.py:SpellingBee:

from tasks.spellingbee import SpellingBee

task = SpellingBee(size=256, split="test")

# Examples are procedurally generated
conversation = task[0]
# User messages have variations (30+ templates, multilingual)
# Assistant responses include manual counting + Python verification

User Message Templates

Highly varied for robustness (45+ variations):

“How many r are in strawberry”
“Count the number of r in strawberry”
“¿Cuántas r hay en strawberry?” (Spanish)
“strawberryに rは何個ありますか” (Japanese)
With/without quotes, question marks, capitalization

Evaluation Method

Identical to GSM8K - extracts answer after #### marker.

SmolTalk

Type: Training only (not evaluated)
Source: HuggingFaceTB/smol-smoltalk
Splits: train (460K), test (24K)

Description

General conversational dataset for teaching chat behavior. Used during supervised fine-tuning, not for evaluation.

Implementation

From tasks/smoltalk.py:SmolTalk:

from tasks.smoltalk import SmolTalk

task = SmolTalk(split="train")

conversation = task[0]
# {
#   "messages": [
#     {"role": "system", "content": "..."},  # Optional
#     {"role": "user", "content": "..."},
#     {"role": "assistant", "content": "..."},
#     {"role": "user", "content": "..."},
#     {"role": "assistant", "content": "..."},
#     # ... alternating user/assistant
#   ]
# }

Format Requirements

Optional system message at the beginning
At least 2 messages after system (user + assistant minimum)
Strict alternation: user, assistant, user, assistant, …
All content must be strings (no multi-part messages)

Running Evaluations

Single Task

python -m scripts.chat_eval \
  --source sft \
  --task-name ARC-Easy

Multiple Tasks

python -m scripts.chat_eval \
  --source sft \
  --task-name "ARC-Easy|MMLU|GSM8K"

All Tasks

python -m scripts.chat_eval --source sft

Distributed Evaluation (8 GPUs)

torchrun --nproc_per_node=8 -m scripts.chat_eval \
  --source sft \
  --task-name "ARC-Easy|ARC-Challenge|MMLU|GSM8K|HumanEval"

Quick Evaluation

python -m scripts.chat_eval \
  --source sft \
  --task-name ARC-Easy \
  --max-problems 100

Evaluation Parameters

For Categorical Tasks

batch-size

int

default:"8"

Number of problems to process in parallel

For Generative Tasks

num-samples

int

default:"1"

Number of completions to generate per problem (pass@k)

max-new-tokens

int

default:"512"

Maximum tokens to generate per completion

temperature

float

default:"0.0"

Sampling temperature (0 = greedy)

top-k

int

default:"50"

Top-k sampling parameter

ChatCORE Metric

When all 6 evaluation tasks are run, a ChatCORE metric is computed (similar to CORE for base models):

# Center each task against its random baseline
for task_name, acc in results.items():
    baseline_acc = baseline_accuracies[task_name]
    centered_acc = (acc - baseline_acc) / (1.0 - baseline_acc)
    centered_mean += centered_acc

chatcore_metric = centered_mean / len(results)

ChatCORE ranges from 0.0 (random) to 1.0 (perfect)

Task Common Interface

All tasks inherit from tasks.common.Task:

class Task:
    @property
    def eval_type(self):
        """Return 'categorical' or 'generative'"""
        raise NotImplementedError
    
    def num_examples(self):
        """Total number of examples"""
        raise NotImplementedError
    
    def get_example(self, index):
        """Get example at index as a conversation dict"""
        raise NotImplementedError
    
    def evaluate(self, conversation, assistant_response):
        """Return True/False or 0/1 for correctness"""
        raise NotImplementedError
    
    def __len__(self):
        return self.num_examples()
    
    def __getitem__(self, index):
        return self.get_example(index)

Reference

Base evaluation: scripts/base_eval.py
Chat evaluation: scripts/chat_eval.py
Task implementations: tasks/*.py

Get Started

Training

Evaluation

Inference

Architecture

Advanced

​Overview

​Task Types

​Categorical

​Generative

​Available Tasks

​ARC (AI2 Reasoning Challenge)

​Description

​Implementation

​Evaluation Method

​MMLU (Massive Multitask Language Understanding)

​Description

​Implementation

​Evaluation Method

​GSM8K (Grade School Math)

​Description

​Example Format

​Implementation

​Evaluation Method

​HumanEval

​Description

​Implementation

​Evaluation Method

​SpellingBee

​Description

​Example

​Implementation

​User Message Templates

​Evaluation Method

​SmolTalk

​Description

​Implementation

​Format Requirements

​Running Evaluations

​Single Task

​Multiple Tasks

​All Tasks

​Distributed Evaluation (8 GPUs)

​Quick Evaluation

​Evaluation Parameters

​For Categorical Tasks

​For Generative Tasks

​ChatCORE Metric

​Task Common Interface

​Reference

Build docs developers (and LLMs) love

Overview

Task Types

Categorical

Generative

Available Tasks

ARC (AI2 Reasoning Challenge)

Description

Implementation

Evaluation Method

MMLU (Massive Multitask Language Understanding)

Description

Implementation

Evaluation Method

GSM8K (Grade School Math)

Description

Example Format

Implementation

Evaluation Method

HumanEval

Description

Implementation

Evaluation Method

SpellingBee

Description

Example

Implementation

User Message Templates

Evaluation Method

SmolTalk

Description

Implementation

Format Requirements

Running Evaluations

Single Task

Multiple Tasks

All Tasks

Distributed Evaluation (8 GPUs)

Quick Evaluation

Evaluation Parameters

For Categorical Tasks

For Generative Tasks

ChatCORE Metric

Task Common Interface

Reference