The ReasoningGymEnv integration wraps Reasoning Gym procedural datasets for use in Verifiers environments.
Reasoning Gym provides a collection of procedurally generated reasoning tasks designed to test various cognitive abilities of language models.
Features
- Procedural generation - Infinite task variations via seeds
- Multiple datasets - Access to all Reasoning Gym tasks
- Composite datasets - Mix multiple tasks with custom weights
- Automatic scoring - Uses task-specific scoring from Reasoning Gym
- XML formatting - Built-in parser for structured outputs
Installation
Install with Reasoning Gym support:
This installs the reasoning-gym package.
Quick Start
Create environment
Create a basic Reasoning Gym environment:import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv
def load_environment():
return ReasoningGymEnv(
gym="arc_1d",
num_train_examples=1000,
num_eval_examples=100,
seed=0,
)
Evaluate
Run an evaluation:prime eval run my-rg-env -m openai/gpt-4.1-mini -n 20
Available Datasets
Reasoning Gym provides tasks across multiple categories:
Pattern Recognition
arc_1d - ARC-like 1D pattern completion
pattern_induction - Identify and continue patterns
Mathematics
elementary_algebra - Solve algebraic equations
number_theory - Number properties and relationships
arithmetic - Basic arithmetic operations
Logic
propositional_logic - Logical reasoning with propositions
spatial_reasoning - Reason about spatial relationships
Other
zebra_puzzle - Logic grid puzzles
word_problems - Text-based reasoning
See the Reasoning Gym repository for the full list.
Configuration
Single Dataset
env = ReasoningGymEnv(
gym="arc_1d",
num_train_examples=2000,
num_eval_examples=500,
seed=0,
)
Multiple Datasets
Combine multiple datasets with equal weights:
env = ReasoningGymEnv(
gym=["arc_1d", "elementary_algebra", "propositional_logic"],
num_train_examples=3000, # 1000 per dataset
num_eval_examples=300, # 100 per dataset
seed=0,
)
Weighted Composite
Mix datasets with custom weights:
env = ReasoningGymEnv(
gym=[
{"name": "arc_1d", "weight": 0.5, "config": {}},
{"name": "elementary_algebra", "weight": 0.3, "config": {}},
{"name": "propositional_logic", "weight": 0.2, "config": {}},
],
num_train_examples=1000,
seed=0,
)
Custom Parser
By default, ReasoningGymEnv uses XMLParser with <think> and <answer> fields. Override with a custom parser:
custom_parser = vf.XMLParser(
fields=["reasoning", "solution"],
answer_field="solution"
)
env = ReasoningGymEnv(
gym="arc_1d",
parser=custom_parser,
num_train_examples=1000,
)
Custom System Prompt
from reasoning_gym.utils import SYSTEM_PROMPTS
env = ReasoningGymEnv(
gym="arc_1d",
system_prompt=SYSTEM_PROMPTS["chain_of_thought"],
num_train_examples=1000,
)
Available system prompts:
"default" - Basic reasoning prompt
"chain_of_thought" - Encourage step-by-step reasoning
"concise" - Encourage brief responses
Scoring
Reasoning Gym tasks have built-in scoring functions. ReasoningGymEnv automatically:
- Parses the model’s answer field
- Calls the task-specific
score_answer() function
- Returns a score (typically 0.0 or 1.0)
Format reward (XML compliance) is tracked separately with weight 0.
Full Example
import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv
def load_environment(
gym: str | list = "arc_1d",
num_train_examples: int = 2000,
num_eval_examples: int = 500,
seed: int = 0,
) -> vf.Environment:
"""Load a Reasoning Gym environment.
Args:
gym: Dataset name, list of names, or composite spec
num_train_examples: Number of training examples
num_eval_examples: Number of eval examples
seed: Random seed for generation
"""
return ReasoningGymEnv(
gym=gym,
num_train_examples=num_train_examples,
num_eval_examples=num_eval_examples,
seed=seed,
)
With composite datasets:
def load_environment():
return ReasoningGymEnv(
gym=[
{"name": "arc_1d", "weight": 0.4},
{"name": "elementary_algebra", "weight": 0.3},
{"name": "propositional_logic", "weight": 0.2},
{"name": "zebra_puzzle", "weight": 0.1},
],
num_train_examples=2000,
num_eval_examples=400,
seed=0,
)
Models should respond with XML-formatted answers:
<think>
Let me analyze the pattern:
- First element: 1
- Second element: 2
- Third element: 4
This appears to be powers of 2.
</think>
<answer>
8
</answer>
The <answer> field is extracted and passed to the task scorer.
Metrics
| Metric | Meaning |
|---|
reward | Task-specific score (0.0 or 1.0) |
format_reward | XML format compliance (weight 0) |
Best Practices
Start with a single dataset to understand task difficulty before mixing multiple datasets.
- Validate baseline - Test with a strong model first to ensure tasks are solvable
- Match difficulty - Mix tasks of similar difficulty for stable training
- Use composite carefully - Large differences in task difficulty can hurt training
- Set appropriate seeds - Different seeds generate different task variations
Comparison with Raw Reasoning Gym
Using Reasoning Gym directly:
import reasoning_gym as rg
dataset = rg.create_dataset("arc_1d", size=100, seed=0)
for entry in dataset:
question = entry["question"]
# ... run model, parse answer
score = dataset.score_answer(answer, entry)
Using ReasoningGymEnv:
env = ReasoningGymEnv(gym="arc_1d", num_train_examples=100, seed=0)
# Handles dataset creation, parsing, scoring automatically
Examples
See the reasoning-gym-env example in the Verifiers repository.
Further Reading