Tasks

nanochat includes a comprehensive task system for evaluating and fine-tuning language models. Tasks provide datasets of conversations along with evaluation criteria.

Task Base Classes

Task

The base class for all tasks provides a lightweight slicing interface over datasets.

from tasks.common import Task

Properties:

eval_type: Returns either 'categorical' for multiple choice tasks or 'generative' for open-ended tasks
start, stop, step: Allow logical slicing over the dataset

Methods:

num_examples(): Returns total number of examples in the dataset
get_example(index): Returns a conversation dict with messages array
evaluate(conversation, assistant_response): Returns evaluation score (typically 0 or 1)
__len__(): Returns the effective length considering slicing parameters
__getitem__(index): Array-style access to conversations

TaskMixture

Combines multiple tasks with deterministic shuffling for SFT training.

from tasks.common import TaskMixture

mixed = TaskMixture([task1, task2, task3])

Tasks are shuffled with a fixed seed (42) to mix examples throughout training. To oversample a task, include it multiple times in the list.

TaskSequence

Sequentially concatenates tasks for curriculum-based training.

from tasks.common import TaskSequence

sequence = TaskSequence([task1, task2, task3])

Evaluation Tasks

ARC

Multiple choice science questions from Allen AI.

from tasks.arc import ARC

# Easy subset
task = ARC(subset="ARC-Easy", split="validation")

# Challenge subset
task = ARC(subset="ARC-Challenge", split="test")

Parameters:

subset: "ARC-Easy" or "ARC-Challenge"
split: "train", "validation", or "test"

Eval type: categorical Dataset: allenai/ai2_arc

MMLU

Massive Multitask Language Understanding - multiple choice questions across 57 subjects.

from tasks.mmlu import MMLU

# All subjects
task = MMLU(subset="all", split="validation")

# Auxiliary training data
task = MMLU(subset="auxiliary_train", split="train")

Parameters:

subset: "all" or "auxiliary_train"
split: "train", "validation", "dev", or "test"

Eval type: categorical Subjects: 57 topics including abstract_algebra, anatomy, astronomy, computer_science, mathematics, physics, and more Dataset: cais/mmlu

GSM8K

8,000 grade school math problems with step-by-step solutions using tool calls.

from tasks.gsm8k import GSM8K

task = GSM8K(subset="main", split="train")

Parameters:

subset: "main" or "socratic"
split: "train" or "test"

Eval type: generative Format: Solutions use <<expression=result>> syntax for calculator tool calls. Final answers are marked with #### number. Example:

Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?

Answer: Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
#### 10

Dataset: openai/gsm8k

HumanEval

Python coding benchmark (the name is a misnomer - it has nothing to do with humans).

from tasks.humaneval import HumanEval

task = HumanEval()

Eval type: generative Format: Each example contains a function signature with docstring (prompt), the canonical solution, and test cases. Evaluation executes the generated code against test cases. Dataset: openai/openai_humaneval

Fine-tuning Tasks

SmolTalk

General conversational dataset from HuggingFace.

from tasks.smoltalk import SmolTalk

task = SmolTalk(split="train")  # 460K conversations
task = SmolTalk(split="test")   # 24K conversations

Parameters:

split: "train" or "test"

Format: Multi-turn conversations with optional system message. Conversations alternate between user and assistant roles. Dataset: HuggingFaceTB/smol-smoltalk

SpellingBee

Teaches models to spell words and count letter occurrences.

from tasks.spellingbee import SpellingBee

task = SpellingBee(size=1000, split="train")

Parameters:

size: Number of examples to generate
split: "train" or "test"

Eval type: generative Purpose: Smaller models struggle with character-level understanding since they work with tokens. This task helps by:

Practicing word spelling (mapping tokens to character sequences)
Counting letter occurrences using both manual and Python verification

Example question variations:

“How many r are in strawberry?”
“Count the number of e in the word hello”
Includes Spanish, Chinese, Korean, French, German, and Japanese variations

Response format: The assistant manually spells out the word, counts occurrences step-by-step, then verifies with Python:

'strawberry'.count('r')

Final answer uses GSM8K-style #### 3 format.

SimpleSpelling

Condensed version focusing only on spelling practice.

from tasks.spellingbee import SimpleSpelling

task = SimpleSpelling(size=1000, split="train")

Format: User asks “Spell the word: example” and assistant responds “example:e,x,a,m,p,l,e”

CustomJSON

Load custom conversations from JSONL files.

from tasks.customjson import CustomJSON

task = CustomJSON(filepath="data/conversations.jsonl")

File format: Each line is a JSON array of message objects:

[{"role":"user","content":"Hi"},{"role":"assistant","content":"Hello"}]
[{"role":"user","content":"Another conversation"},{"role":"assistant","content":"Yes"}]

Requirements:

At least 2 messages per conversation
Messages must alternate: user, assistant, user, assistant…
Each message needs role and content fields
Content must be a string

Helper Functions

render_mc

Standard format for multiple choice questions:

from tasks.common import render_mc

question = "What is the capital of France?"
letters = ("A", "B", "C", "D")
choices = ["London", "Paris", "Berlin", "Madrid"]

user_message = render_mc(question, letters, choices)

Important design decisions:

Letter comes AFTER the choice for better token binding in smaller models
No whitespace before the letter (“=A” not ”= A”) to match tokenization of assistant responses

Output format:

Multiple Choice question: What is the capital of France?
- London=A
- Paris=B
- Berlin=C
- Madrid=D

Respond only with the letter of the correct answer.

Usage Examples

Training with task mixtures

from tasks.smoltalk import SmolTalk
from tasks.spellingbee import SpellingBee
from tasks.common import TaskMixture

# Oversample SpellingBee by including it twice
task = TaskMixture([
    SmolTalk(split="train"),
    SpellingBee(size=5000, split="train"),
    SpellingBee(size=5000, split="train"),
])

Slicing datasets

from tasks.mmlu import MMLU

# First 100 examples
task = MMLU(subset="all", split="validation", start=0, stop=100)

# Every 10th example
task = MMLU(subset="all", split="validation", step=10)

Custom evaluation

task = GSM8K(subset="main", split="test")

for i in range(10):
    conversation = task[i]
    # Generate response with your model
    response = model.generate(conversation['messages'][0]['content'])
    # Evaluate
    score = task.evaluate(conversation, response)
    print(f"Problem {i}: {'✓' if score else '✗'}")

Core Modules

Training

Scripts

Tasks

Task Base Classes

Task

TaskMixture

TaskSequence

Evaluation Tasks

ARC

MMLU

GSM8K

HumanEval

Fine-tuning Tasks

SmolTalk

SpellingBee

SimpleSpelling

CustomJSON

Helper Functions

render_mc

Usage Examples

Training with task mixtures

Slicing datasets

Custom evaluation

Build docs developers (and LLMs) love

Core Modules

Training

Scripts

Tasks

​Task Base Classes

​Task

​TaskMixture

​TaskSequence

​Evaluation Tasks

​ARC

​MMLU

​GSM8K

​HumanEval

​Fine-tuning Tasks

​SmolTalk

​SpellingBee

​SimpleSpelling

​CustomJSON

​Helper Functions

​render_mc

​Usage Examples

​Training with task mixtures

​Slicing datasets

​Custom evaluation

Build docs developers (and LLMs) love

Task Base Classes

Task

TaskMixture

TaskSequence

Evaluation Tasks

ARC

MMLU

GSM8K

HumanEval

Fine-tuning Tasks

SmolTalk

SpellingBee

SimpleSpelling

CustomJSON

Helper Functions

render_mc

Usage Examples

Training with task mixtures

Slicing datasets

Custom evaluation