Skip to main content
This example demonstrates the simplest type of Verifiers environment: single-turn question answering with symbolic verification. Models solve grade school math word problems using natural language reasoning.

Overview

The GSM8K environment provides:
  • Dataset: 8.5K grade school math word problems
  • Format: Single-turn response (no tools or code execution)
  • Evaluation: Symbolic math verification of final answers
  • Difficulty: Elementary to middle school level

Complete Implementation

Here’s the full working implementation from environments/gsm8k/gsm8k.py:
import verifiers as vf
from verifiers.utils.data_utils import (
    BOXED_SYSTEM_PROMPT,
    load_example_dataset,
)


def load_environment(
    system_prompt: str = BOXED_SYSTEM_PROMPT,
    num_train_examples=-1,
    num_eval_examples=-1,
):
    dataset = load_example_dataset("gsm8k", split="train")
    if num_train_examples != -1:
        dataset = dataset.select(range(num_train_examples))
    eval_dataset = load_example_dataset("gsm8k", split="test")
    if num_eval_examples != -1:
        eval_dataset = eval_dataset.select(range(num_eval_examples))

    rubric = vf.MathRubric()
    vf_env = vf.SingleTurnEnv(
        dataset=dataset,
        eval_dataset=eval_dataset,
        system_prompt=system_prompt,
        parser=rubric.parser,
        rubric=rubric,
    )
    return vf_env

How It Works

1. Dataset Loading

GSM8K contains grade school math word problems:
dataset = load_example_dataset("gsm8k", split="train")
eval_dataset = load_example_dataset("gsm8k", split="test")
Dataset size:
  • Training: 7,473 problems
  • Test: 1,319 problems
Dataset format:
{
    "question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
    "answer": "18"  # Extracted from "#### 18"
}

2. System Prompt

The default system prompt uses the standard BOXED_SYSTEM_PROMPT:
BOXED_SYSTEM_PROMPT = "Please reason step by step, and put your final answer within \\boxed{}."
This encourages:
  • Step-by-step reasoning (chain of thought)
  • Clear final answer formatting using \boxed{}

3. Evaluation with MathRubric

MathRubric handles answer extraction and verification:
rubric = vf.MathRubric()
Features:
  • Extracts answers from \boxed{} notation
  • Symbolic equivalence checking (handles different representations)
  • Numerical tolerance for floating-point answers
  • Returns 1.0 for correct, 0.0 for incorrect

4. Single-Turn Environment

SingleTurnEnv is the simplest environment type:
  • Model receives question
  • Model generates single response
  • Response is evaluated
  • No multi-turn interaction or tools
vf_env = vf.SingleTurnEnv(
    dataset=dataset,
    eval_dataset=eval_dataset,
    system_prompt=system_prompt,
    parser=rubric.parser,
    rubric=rubric,
)

Example Interactions

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market?Model Response: Let me work through this step-by-step:
  1. Janet’s ducks lay 16 eggs per day
  2. She eats 3 eggs for breakfast
  3. She uses 4 eggs for muffins
  4. Eggs remaining: 16 - 3 - 4 = 9 eggs
  5. She sells each egg for $2
  6. Total earnings: 9 × 2=2 = 18
Therefore, Janet makes 18\boxed{18} dollars every day at the farmers’ market.Result: ✓ Correct (reward = 1.0)

Running the Environment

Installation

# Install from environments directory
prime env install gsm8k

Quick Evaluation

# Evaluate on 20 test problems with 5 rollouts each
prime eval run gsm8k \
  -m openai/gpt-4.1-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -n 20 \
  -r 5

Training Dataset

# Use training split for data generation
prime eval run gsm8k \
  -m openai/gpt-4.1-mini \
  --split train \
  -n 100 \
  -r 8

Custom Configuration

# Limit dataset size and customize prompt
prime eval run gsm8k \
  -m openai/gpt-4.1-mini \
  -a '{
    "num_train_examples": 100,
    "num_eval_examples": 50,
    "system_prompt": "Solve the math problem step-by-step. Show your work and put your final numerical answer in \\\\boxed{}."
  }' \
  -n 50 \
  -r 4

Configuration Options

ParameterDefaultDescription
system_promptBOXED_SYSTEM_PROMPTInstructions for the model
num_train_examples-1Number of training examples (-1 = all)
num_eval_examples-1Number of eval examples (-1 = all)

Key Features

Separate Train/Eval Datasets

GSM8K demonstrates proper train/eval split:
dataset = load_example_dataset("gsm8k", split="train")        # 7,473 problems
eval_dataset = load_example_dataset("gsm8k", split="test")   # 1,319 problems
When running prime eval run, the eval dataset is used by default.

Answer Format Flexibility

MathRubric handles various answer formats:
# All equivalent:
"\\boxed{18}"      # Standard LaTeX
"\\boxed{18.0}"    # Floating point
"\\boxed{18.00}"   # Extra precision
"$\\boxed{18}$"    # With dollar signs

Symbolic Verification

MathRubric uses symbolic math verification:
# These are all considered correct for answer "0.5":
"\\boxed{0.5}"     # Decimal
"\\boxed{1/2}"     # Fraction
"\\boxed{0.50}"    # Extra zeros
"\\boxed{.5}"      # No leading zero

Metrics Tracked

  • correct_answer: 1.0 if answer matches ground truth, 0.0 otherwise
  • reward: Same as correct_answer for this simple environment

Advanced Usage

Custom System Prompts

Experiment with different prompting strategies:
# Chain of thought with explicit steps
COT_PROMPT = """Solve this math problem step-by-step.

1. Identify what is being asked
2. List the given information
3. Perform the calculations
4. State your final answer in \\boxed{}
"""

env = load_environment(system_prompt=COT_PROMPT)

Filtering Dataset

Create custom dataset subsets:
import verifiers as vf
from verifiers.utils.data_utils import load_example_dataset

def load_environment():
    dataset = load_example_dataset("gsm8k", split="test")
    
    # Filter for problems containing specific keywords
    def is_about_money(example):
        question = example["question"].lower()
        return "dollar" in question or "$" in question
    
    dataset = dataset.filter(is_about_money)
    
    rubric = vf.MathRubric()
    return vf.SingleTurnEnv(
        dataset=dataset,
        system_prompt="Solve this money-related math problem.",
        rubric=rubric,
    )

Adding Metrics

Track additional metrics beyond correctness:
import verifiers as vf

def load_environment():
    dataset = load_example_dataset("gsm8k", split="test")
    rubric = vf.MathRubric()
    
    # Track response length
    async def response_length(completion) -> float:
        return float(len(completion[-1]["content"]))
    
    # Track number of steps
    async def num_steps(completion) -> float:
        content = completion[-1]["content"]
        # Count numbered steps like "1.", "2.", etc.
        import re
        steps = re.findall(r'^\d+\.', content, re.MULTILINE)
        return float(len(steps))
    
    rubric.add_metric(response_length)
    rubric.add_metric(num_steps)
    
    return vf.SingleTurnEnv(
        dataset=dataset,
        rubric=rubric,
    )

Comparison with Math Python

FeatureGSM8KMath Python
InteractionSingle-turnMulti-turn
ToolsNonePython REPL
SandboxNoYes (containerized)
DifficultyElementary/Middle schoolCompetition math
ReasoningNatural languageCode + reasoning
Startup timeInstant~10-30s (sandbox)
Use caseQuick evaluationComplex computation

Next Steps

Build docs developers (and LLMs) love