Task Base Classes
Task
The base class for all tasks provides a lightweight slicing interface over datasets.eval_type: Returns either'categorical'for multiple choice tasks or'generative'for open-ended tasksstart,stop,step: Allow logical slicing over the dataset
num_examples(): Returns total number of examples in the datasetget_example(index): Returns a conversation dict withmessagesarrayevaluate(conversation, assistant_response): Returns evaluation score (typically 0 or 1)__len__(): Returns the effective length considering slicing parameters__getitem__(index): Array-style access to conversations
TaskMixture
Combines multiple tasks with deterministic shuffling for SFT training.TaskSequence
Sequentially concatenates tasks for curriculum-based training.Evaluation Tasks
ARC
Multiple choice science questions from Allen AI.subset:"ARC-Easy"or"ARC-Challenge"split:"train","validation", or"test"
categorical
Dataset: allenai/ai2_arc
MMLU
Massive Multitask Language Understanding - multiple choice questions across 57 subjects.subset:"all"or"auxiliary_train"split:"train","validation","dev", or"test"
categorical
Subjects: 57 topics including abstract_algebra, anatomy, astronomy, computer_science, mathematics, physics, and more
Dataset: cais/mmlu
GSM8K
8,000 grade school math problems with step-by-step solutions using tool calls.subset:"main"or"socratic"split:"train"or"test"
generative
Format: Solutions use <<expression=result>> syntax for calculator tool calls. Final answers are marked with #### number.
Example:
HumanEval
Python coding benchmark (the name is a misnomer - it has nothing to do with humans).generative
Format: Each example contains a function signature with docstring (prompt), the canonical solution, and test cases. Evaluation executes the generated code against test cases.
Dataset: openai/openai_humaneval
Fine-tuning Tasks
SmolTalk
General conversational dataset from HuggingFace.split:"train"or"test"
SpellingBee
Teaches models to spell words and count letter occurrences.size: Number of examples to generatesplit:"train"or"test"
generative
Purpose: Smaller models struggle with character-level understanding since they work with tokens. This task helps by:
- Practicing word spelling (mapping tokens to character sequences)
- Counting letter occurrences using both manual and Python verification
- “How many r are in strawberry?”
- “Count the number of e in the word hello”
- Includes Spanish, Chinese, Korean, French, German, and Japanese variations
#### 3 format.
SimpleSpelling
Condensed version focusing only on spelling practice.CustomJSON
Load custom conversations from JSONL files.- At least 2 messages per conversation
- Messages must alternate: user, assistant, user, assistant…
- Each message needs
roleandcontentfields - Content must be a string
Helper Functions
render_mc
Standard format for multiple choice questions:- Letter comes AFTER the choice for better token binding in smaller models
- No whitespace before the letter (“=A” not ”= A”) to match tokenization of assistant responses