Skip to main content

Data Utilities

Utility functions for loading and formatting common benchmark datasets.
These utilities are designed for example datasets and quick prototyping, not core functionality. For production use, load datasets directly using HuggingFace datasets library.

Overview

The verifiers.utils.data_utils module provides:
  • Pre-configured loaders for common benchmarks (GSM8K, MATH, GPQA, etc.)
  • Dataset formatting helpers
  • Answer extraction utilities
  • System prompts for math tasks

Dataset Loaders

load_example_dataset

def load_example_dataset(
    name: str = "gsm8k",
    split: str | None = None,
    n: int | None = None,
    seed: int = 0
) -> Dataset
Load a preprocessed benchmark dataset.
name
str
default:"gsm8k"
Dataset name. Supported: "aime2024", "aime2025", "amc2023", "gpqa_diamond", "gpqa_main", "gsm8k", "math", "math500", "mmlu", "mmlu_pro", "openbookqa", "openrs", "openrs_easy", "openrs_hard", "prime_code".
split
str | None
default:"None"
Dataset split. If None, uses the default split for that dataset (usually “test” or “train”).
n
int | None
default:"None"
Number of examples to load. If None, loads all examples.
seed
int
default:"0"
Random seed for shuffling when n is specified.
Returns: HuggingFace Dataset with question and answer columns. Example:
from verifiers.utils.data_utils import load_example_dataset

# Load 100 GSM8K examples
dataset = load_example_dataset("gsm8k", n=100)

# Load all MATH problems
math_dataset = load_example_dataset("math", split="train")

# Load GPQA diamond
gpqa = load_example_dataset("gpqa_diamond")

Formatting Functions

format_dataset

def format_dataset(
    dataset: Dataset,
    system_prompt: str | None = None,
    few_shot: Messages | None = None,
    question_key: str = "question",
    answer_key: str = "answer",
    map_kwargs: dict = {},
) -> Dataset
Add example_id and prompt columns to a dataset.
dataset
Dataset
required
Input dataset to format.
system_prompt
str | None
System prompt to prepend to all prompts.
few_shot
Messages | None
Few-shot examples to include before each question.
question_key
str
default:"question"
Column name containing questions.
answer_key
str
default:"answer"
Column name containing answers.
map_kwargs
dict
Additional arguments passed to dataset.map().
Returns: Dataset with example_id and prompt columns. Example:
from verifiers.utils.data_utils import format_dataset, BOXED_SYSTEM_PROMPT
from datasets import load_dataset

# Load raw dataset
raw_dataset = load_dataset("gsm8k", "main", split="test")

# Format with system prompt
formatted = format_dataset(
    raw_dataset,
    system_prompt=BOXED_SYSTEM_PROMPT,
    question_key="question",
    answer_key="answer",
)

# Now has 'prompt' column with messages
print(formatted[0]["prompt"])
# [
#   {"role": "system", "content": "Please reason step by step..."},
#   {"role": "user", "content": "What is 2+2?"}
# ]

Answer Extraction

extract_boxed_answer

def extract_boxed_answer(text: str) -> str
Extract content from LaTeX \boxed{} commands. Finds the last occurrence of \boxed{...} in the text and returns the content between matching braces. If no boxed answer is found or braces don’t match, returns the original text.
text
str
required
Text containing LaTeX boxed answer (e.g., "\boxed{42}").
Returns: str - Extracted answer content, or original text if no valid boxed answer found. Example:
from verifiers.utils.data_utils import extract_boxed_answer

# Simple answer
text = "The answer is \\boxed{42}"
result = extract_boxed_answer(text)  # "42"

# Nested braces
text = "\\boxed{x = \\frac{1}{2}}"
result = extract_boxed_answer(text)  # "x = \\frac{1}{2}"

# Multiple boxed answers - extracts last one
text = "First \\boxed{A}, then \\boxed{B}"
result = extract_boxed_answer(text)  # "B"

# No boxed answer
text = "Just plain text"
result = extract_boxed_answer(text)  # "Just plain text"

extract_hash_answer

def extract_hash_answer(text: str) -> str
Extract answer after #### delimiter (GSM8K format). Returns the text after the first #### marker, stripped of leading/trailing whitespace. If no delimiter is found, returns the original text.
text
str
required
Text containing hash-delimited answer (e.g., "Solution here\n#### 42").
Returns: str - Answer after #### delimiter, or original text if no delimiter found. Example:
from verifiers.utils.data_utils import extract_hash_answer

# Standard GSM8K format
text = "Step 1: Add them.\nStep 2: Get result.\n#### 42"
result = extract_hash_answer(text)  # "42"

# With spaces
text = "Solution goes here #### 100"
result = extract_hash_answer(text)  # "100"

# No delimiter
text = "Just an answer"
result = extract_hash_answer(text)  # "Just an answer"

strip_non_numeric

def strip_non_numeric(text: str) -> str
Remove all non-numeric characters except periods. Example:
from verifiers.utils.data_utils import strip_non_numeric

text = "The answer is $42.5"
result = strip_non_numeric(text)  # "42.5"

System Prompts

BOXED_SYSTEM_PROMPT

BOXED_SYSTEM_PROMPT = "Please reason step by step, and put your final answer within \\boxed{}."
Standard prompt for math problems requiring boxed answers.

THINK_BOXED_SYSTEM_PROMPT

THINK_BOXED_SYSTEM_PROMPT = "Think step-by-step inside <think>...</think> tags. Then, give your final answer inside \\boxed{}."
Prompt encouraging explicit reasoning in XML tags. Example:
import verifiers as vf
from verifiers.utils.data_utils import (
    load_example_dataset,
    format_dataset,
    BOXED_SYSTEM_PROMPT,
)

def load_environment():
    # Load and format dataset
    dataset = load_example_dataset("math", n=100)
    dataset = format_dataset(
        dataset,
        system_prompt=BOXED_SYSTEM_PROMPT,
    )
    
    def correct(answer: str, completion: str, **kwargs) -> float:
        from verifiers.utils.data_utils import extract_boxed_answer
        extracted = extract_boxed_answer(completion)
        return 1.0 if extracted == answer else 0.0
    
    return vf.SingleTurnEnv(
        dataset=dataset,
        rubric=vf.Rubric(correct),
    )

Supported Datasets

aime2024
AIME 2024 math competition (15 problems)
aime2025
AIME 2025 math competition (30 problems, AIME I + II)
amc2023
AMC 2023 math competition
gpqa_diamond
GPQA Diamond subset (high-quality questions)
gpqa_main
GPQA Main dataset
gsm8k
Grade School Math 8K dataset
math
MATH competition dataset
math500
MATH-500 subset
mmlu
Massive Multitask Language Understanding
mmlu_pro
MMLU-Pro (harder variant)
openbookqa
OpenBookQA question answering
openrs / openrs_easy / openrs_hard
OpenRS reasoning problems
prime_code
Prime verifiable coding problems

Preprocessing Functions

Internal preprocessing functions used by load_example_dataset():
def get_preprocess_fn(name: str) -> Callable[[dict], dict]
Returns a preprocessing function for the named dataset. Each preprocessor:
  • Extracts question and answer fields
  • Normalizes format (e.g., strips #### delimiters)
  • Handles dataset-specific quirks
These are internal functions. Use load_example_dataset() instead of calling preprocessors directly.

Custom Dataset Example

import verifiers as vf
from verifiers.utils.data_utils import format_dataset
from datasets import Dataset

# Create custom dataset
raw_data = [
    {"question": "What is 2+2?", "answer": "4"},
    {"question": "What is 10*5?", "answer": "50"},
]

dataset = Dataset.from_list(raw_data)

# Format with system prompt
formatted = format_dataset(
    dataset,
    system_prompt="Solve the math problem.",
)

# Use in environment
env = vf.SingleTurnEnv(
    dataset=formatted,
    rubric=vf.Rubric(lambda answer, completion, **kw: 1.0 if answer in completion else 0.0),
)

See Also

Build docs developers (and LLMs) love