Overview
The Math Python environment combines:- Dataset: MATH competition problems (or custom math datasets)
- Tools: Python REPL with numpy, sympy, scipy
- Evaluation: Symbolic math verification using
\boxed{}answer format - Sandbox: Isolated execution environment with configurable resources
Complete Implementation
Here’s the full working implementation fromenvironments/math_python/math_python.py:
How It Works
1. Dataset Loading
The environment uses theload_example_dataset utility to load math problems:
"math"- MATH competition problems (training: 7,500 problems)"math500"- MATH-500 benchmark (500 test problems)"aime2024","aime2025"- AIME competition problems"gsm8k"- Grade school math (see GSM8K example)
2. System Prompt
The system prompt instructs the model to:- Use Python for calculations
- Format final answers using
\boxed{}notation - Lists available packages (numpy, sympy, scipy by default)
3. Answer Parsing
Theextract_boxed_answer function extracts content from LaTeX \boxed{} notation:
4. Math Verification
MathRubric provides symbolic math verification:
- Symbolic equivalence checking (e.g., “1/2” equals “0.5”)
- LaTeX expression normalization
- Floating-point tolerance for numerical answers
- Returns 1.0 for correct answers, 0.0 otherwise
5. Python Sandbox Environment
PythonEnv provides:
- Isolated execution environment (Docker container)
- Persistent Python REPL session
- Pre-installed packages (numpy, sympy, scipy)
- Configurable resources (CPU, memory, disk)
- Automatic cleanup after rollouts
Example Interaction
- Model Interaction
- Dataset Sample
- Expected Solution
User: What is the value of ?Assistant: I’ll use Python to calculate this.Tool Output:
5.0Assistant: The value is Result: ✓ Correct (reward = 1.0)Running the Environment
Installation
Quick Evaluation
Custom Configuration
Configuration Options
| Parameter | Default | Description |
|---|---|---|
dataset_name | "math" | Dataset to use (math, math500, aime2024, etc.) |
dataset_split | "train" | Dataset split (train, test) |
num_train_examples | -1 | Number of examples (-1 = all) |
max_turns | 100 | Maximum interaction turns |
pip_install_packages | "numpy sympy scipy" | Space-separated package list |
sandbox_cpu_cores | 1 | CPU cores for sandbox |
sandbox_memory_gb | 2 | Memory in GB |
sandbox_disk_size_gb | 5 | Disk size in GB |
sandbox_timeout_minutes | 60 | Sandbox lifetime timeout |
Key Features
Sandboxed Execution
- Isolation: Each rollout gets a fresh sandbox container
- Security: No access to host filesystem or network (by default)
- Resource limits: Configurable CPU, memory, and disk quotas
- Automatic cleanup: Containers are destroyed after rollouts
Package Management
Customize available packages:Multi-Turn Interaction
The environment supports iterative problem-solving:- Model writes Python code
- Code executes in sandbox
- Model sees output and continues reasoning
- Repeats until model provides final answer or hits
max_turns
Metrics Tracked
correct_answer: 1.0 if answer matches ground truth, 0.0 otherwisenum_turns: Number of model-environment interactionssandbox_ready_wait_time: Time to initialize sandbox (seconds)sandbox_command_execution_time: Total time executing Python codepython_ready_wait_time: Time to start Python REPL
Advanced Usage
Custom Answer Extraction
Provide your own answer extraction logic:Custom Reward Functions
Add additional reward signals:Related Examples
- GSM8K - Single-turn math reasoning without code execution
- Wiki Search - Tool environment with custom tools
- Browser Examples - More complex stateful environments
Next Steps
- Learn about Environments to understand the architecture
- See Sandboxes for more on containerized execution
- Explore Rubrics for custom evaluation logic