Skip to main content
This example demonstrates how to create an environment where models can solve math problems by writing and executing Python code. The environment provides a sandboxed Python REPL with scientific computing libraries.

Overview

The Math Python environment combines:
  • Dataset: MATH competition problems (or custom math datasets)
  • Tools: Python REPL with numpy, sympy, scipy
  • Evaluation: Symbolic math verification using \boxed{} answer format
  • Sandbox: Isolated execution environment with configurable resources

Complete Implementation

Here’s the full working implementation from environments/math_python/math_python.py:
import verifiers as vf
from verifiers.utils.data_utils import extract_boxed_answer, load_example_dataset


def load_environment(
    dataset_name: str = "math",
    dataset_split: str = "train",
    num_train_examples: int = -1,
    max_turns: int = 100,
    max_startup_wait_seconds: int = 60,
    pip_install_packages: str = "numpy sympy scipy",
    sandbox_cpu_cores: int = 1,
    sandbox_memory_gb: int = 2,
    sandbox_disk_size_gb: int = 5,
    sandbox_gpu_count: int = 0,
    sandbox_timeout_minutes: int = 60,
    sandbox_timeout_per_command_seconds: int = 60,
    sandbox_client_max_workers: int = 50,
    **kwargs,
):
    dataset = load_example_dataset(dataset_name, dataset_split, n=num_train_examples)
    pip_install_prompt = (
        f"In addition to the Python standard library, you have access to: {pip_install_packages}."
        if pip_install_packages.strip()
        else "You may only use the Python standard library."
    )
    system_prompt = (
        "Use Python for all calculations. Give your answer inside \\boxed{}."
    )
    system_prompt += "\n\n" + pip_install_prompt

    parser = vf.Parser(extract_fn=extract_boxed_answer)
    math_rubric = vf.MathRubric(parser=parser)
    return vf.PythonEnv(
        dataset=dataset,
        system_prompt=system_prompt,
        parser=parser,
        rubric=math_rubric,
        max_turns=max_turns,
        # python env args
        max_startup_wait_seconds=max_startup_wait_seconds,
        pip_install_packages=pip_install_packages,
        # sandbox env args
        cpu_cores=sandbox_cpu_cores,
        memory_gb=sandbox_memory_gb,
        disk_size_gb=sandbox_disk_size_gb,
        gpu_count=sandbox_gpu_count,
        timeout_minutes=sandbox_timeout_minutes,
        timeout_per_command_seconds=sandbox_timeout_per_command_seconds,
        sandbox_client_max_workers=sandbox_client_max_workers,
        **kwargs,
    )

How It Works

1. Dataset Loading

The environment uses the load_example_dataset utility to load math problems:
dataset = load_example_dataset("math", "train", n=num_train_examples)
Supported datasets:
  • "math" - MATH competition problems (training: 7,500 problems)
  • "math500" - MATH-500 benchmark (500 test problems)
  • "aime2024", "aime2025" - AIME competition problems
  • "gsm8k" - Grade school math (see GSM8K example)
Dataset format:
{
    "question": "What is the value of $\\sqrt{3^2 + 4^2}$?",
    "answer": "5"
}

2. System Prompt

The system prompt instructs the model to:
  • Use Python for calculations
  • Format final answers using \boxed{} notation
  • Lists available packages (numpy, sympy, scipy by default)
system_prompt = (
    "Use Python for all calculations. Give your answer inside \\boxed{}.\n\n"
    "In addition to the Python standard library, you have access to: numpy sympy scipy."
)

3. Answer Parsing

The extract_boxed_answer function extracts content from LaTeX \boxed{} notation:
parser = vf.Parser(extract_fn=extract_boxed_answer)

# Example: "The answer is \\boxed{42}" → "42"

4. Math Verification

MathRubric provides symbolic math verification:
math_rubric = vf.MathRubric(parser=parser)
Features:
  • Symbolic equivalence checking (e.g., “1/2” equals “0.5”)
  • LaTeX expression normalization
  • Floating-point tolerance for numerical answers
  • Returns 1.0 for correct answers, 0.0 otherwise

5. Python Sandbox Environment

PythonEnv provides:
  • Isolated execution environment (Docker container)
  • Persistent Python REPL session
  • Pre-installed packages (numpy, sympy, scipy)
  • Configurable resources (CPU, memory, disk)
  • Automatic cleanup after rollouts

Example Interaction

User: What is the value of 32+42\sqrt{3^2 + 4^2}?Assistant: I’ll use Python to calculate this.
import math
result = math.sqrt(3**2 + 4**2)
print(result)
Tool Output: 5.0Assistant: The value is 5\boxed{5}Result: ✓ Correct (reward = 1.0)

Running the Environment

Installation

# Install from environments directory
prime env install math-python

Quick Evaluation

# Evaluate with 10 problems
prime eval run math-python \
  -m openai/gpt-4.1-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -n 10 \
  -r 1

Custom Configuration

# Use MATH-500 benchmark with more resources
prime eval run math-python \
  -m openai/gpt-4.1-mini \
  -a '{
    "dataset_name": "math500",
    "dataset_split": "test",
    "sandbox_cpu_cores": 2,
    "sandbox_memory_gb": 4,
    "pip_install_packages": "numpy sympy scipy matplotlib"
  }' \
  -n 50 \
  -r 4

Configuration Options

ParameterDefaultDescription
dataset_name"math"Dataset to use (math, math500, aime2024, etc.)
dataset_split"train"Dataset split (train, test)
num_train_examples-1Number of examples (-1 = all)
max_turns100Maximum interaction turns
pip_install_packages"numpy sympy scipy"Space-separated package list
sandbox_cpu_cores1CPU cores for sandbox
sandbox_memory_gb2Memory in GB
sandbox_disk_size_gb5Disk size in GB
sandbox_timeout_minutes60Sandbox lifetime timeout

Key Features

Sandboxed Execution

  • Isolation: Each rollout gets a fresh sandbox container
  • Security: No access to host filesystem or network (by default)
  • Resource limits: Configurable CPU, memory, and disk quotas
  • Automatic cleanup: Containers are destroyed after rollouts

Package Management

Customize available packages:
env = load_environment(
    pip_install_packages="numpy sympy scipy matplotlib pandas"
)
Or restrict to standard library only:
env = load_environment(
    pip_install_packages=""  # Empty string = standard library only
)

Multi-Turn Interaction

The environment supports iterative problem-solving:
  1. Model writes Python code
  2. Code executes in sandbox
  3. Model sees output and continues reasoning
  4. Repeats until model provides final answer or hits max_turns

Metrics Tracked

  • correct_answer: 1.0 if answer matches ground truth, 0.0 otherwise
  • num_turns: Number of model-environment interactions
  • sandbox_ready_wait_time: Time to initialize sandbox (seconds)
  • sandbox_command_execution_time: Total time executing Python code
  • python_ready_wait_time: Time to start Python REPL

Advanced Usage

Custom Answer Extraction

Provide your own answer extraction logic:
def custom_extract_answer(text: str) -> str:
    """Extract answer from custom format."""
    if "ANSWER:" in text:
        return text.split("ANSWER:")[1].strip()
    return text

parser = vf.Parser(extract_fn=custom_extract_answer)
rubric = vf.MathRubric(parser=parser)
env = vf.PythonEnv(
    dataset=dataset,
    parser=parser,
    rubric=rubric,
    system_prompt="Solve the problem and format your answer as ANSWER: <value>"
)

Custom Reward Functions

Add additional reward signals:
async def efficiency_bonus(state, answer) -> float:
    """Reward shorter solutions."""
    num_turns = state.get("turn", 0)
    is_correct = state.get("completion", [])[-1].get("content", "").strip()
    if answer in is_correct and num_turns < 5:
        return 0.2  # Bonus for solving quickly
    return 0.0

math_rubric.add_reward_func(efficiency_bonus, weight=1.0)
  • GSM8K - Single-turn math reasoning without code execution
  • Wiki Search - Tool environment with custom tools
  • Browser Examples - More complex stateful environments

Next Steps

  • Learn about Environments to understand the architecture
  • See Sandboxes for more on containerized execution
  • Explore Rubrics for custom evaluation logic

Build docs developers (and LLMs) love