Overview
The GSM8K environment provides:- Dataset: 8.5K grade school math word problems
- Format: Single-turn response (no tools or code execution)
- Evaluation: Symbolic math verification of final answers
- Difficulty: Elementary to middle school level
Complete Implementation
Here’s the full working implementation fromenvironments/gsm8k/gsm8k.py:
How It Works
1. Dataset Loading
GSM8K contains grade school math word problems:- Training: 7,473 problems
- Test: 1,319 problems
2. System Prompt
The default system prompt uses the standardBOXED_SYSTEM_PROMPT:
- Step-by-step reasoning (chain of thought)
- Clear final answer formatting using
\boxed{}
3. Evaluation with MathRubric
MathRubric handles answer extraction and verification:
- Extracts answers from
\boxed{}notation - Symbolic equivalence checking (handles different representations)
- Numerical tolerance for floating-point answers
- Returns 1.0 for correct, 0.0 for incorrect
4. Single-Turn Environment
SingleTurnEnv is the simplest environment type:
- Model receives question
- Model generates single response
- Response is evaluated
- No multi-turn interaction or tools
Example Interactions
- Example 1: Ducks
- Example 2: Store
- Example 3: Incorrect
Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market?Model Response:
Let me work through this step-by-step:
- Janet’s ducks lay 16 eggs per day
- She eats 3 eggs for breakfast
- She uses 4 eggs for muffins
- Eggs remaining: 16 - 3 - 4 = 9 eggs
- She sells each egg for $2
- Total earnings: 9 × 18
Running the Environment
Installation
Quick Evaluation
Training Dataset
Custom Configuration
Configuration Options
| Parameter | Default | Description |
|---|---|---|
system_prompt | BOXED_SYSTEM_PROMPT | Instructions for the model |
num_train_examples | -1 | Number of training examples (-1 = all) |
num_eval_examples | -1 | Number of eval examples (-1 = all) |
Key Features
Separate Train/Eval Datasets
GSM8K demonstrates proper train/eval split:prime eval run, the eval dataset is used by default.
Answer Format Flexibility
MathRubric handles various answer formats:
Symbolic Verification
MathRubric uses symbolic math verification:Metrics Tracked
correct_answer: 1.0 if answer matches ground truth, 0.0 otherwisereward: Same ascorrect_answerfor this simple environment
Advanced Usage
Custom System Prompts
Experiment with different prompting strategies:Filtering Dataset
Create custom dataset subsets:Adding Metrics
Track additional metrics beyond correctness:Comparison with Math Python
| Feature | GSM8K | Math Python |
|---|---|---|
| Interaction | Single-turn | Multi-turn |
| Tools | None | Python REPL |
| Sandbox | No | Yes (containerized) |
| Difficulty | Elementary/Middle school | Competition math |
| Reasoning | Natural language | Code + reasoning |
| Startup time | Instant | ~10-30s (sandbox) |
| Use case | Quick evaluation | Complex computation |
Related Examples
- Math Python - Math problems with code execution
- Wordle - Multi-turn game environment
- Wiki Search - Tool-based question answering
Next Steps
- Learn about SingleTurnEnv architecture
- Explore MathRubric for symbolic verification
- See Datasets for loading and formatting data