Skip to main content
The ReasoningGymEnv integration wraps Reasoning Gym procedural datasets for use in Verifiers environments. Reasoning Gym provides a collection of procedurally generated reasoning tasks designed to test various cognitive abilities of language models.

Features

  • Procedural generation - Infinite task variations via seeds
  • Multiple datasets - Access to all Reasoning Gym tasks
  • Composite datasets - Mix multiple tasks with custom weights
  • Automatic scoring - Uses task-specific scoring from Reasoning Gym
  • XML formatting - Built-in parser for structured outputs

Installation

Install with Reasoning Gym support:
uv add 'verifiers[rg]'
This installs the reasoning-gym package.

Quick Start

1

Create environment

Create a basic Reasoning Gym environment:
import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv

def load_environment():
    return ReasoningGymEnv(
        gym="arc_1d",
        num_train_examples=1000,
        num_eval_examples=100,
        seed=0,
    )
2

Evaluate

Run an evaluation:
prime eval run my-rg-env -m openai/gpt-4.1-mini -n 20

Available Datasets

Reasoning Gym provides tasks across multiple categories:

Pattern Recognition

  • arc_1d - ARC-like 1D pattern completion
  • pattern_induction - Identify and continue patterns

Mathematics

  • elementary_algebra - Solve algebraic equations
  • number_theory - Number properties and relationships
  • arithmetic - Basic arithmetic operations

Logic

  • propositional_logic - Logical reasoning with propositions
  • spatial_reasoning - Reason about spatial relationships

Other

  • zebra_puzzle - Logic grid puzzles
  • word_problems - Text-based reasoning
See the Reasoning Gym repository for the full list.

Configuration

Single Dataset

env = ReasoningGymEnv(
    gym="arc_1d",
    num_train_examples=2000,
    num_eval_examples=500,
    seed=0,
)

Multiple Datasets

Combine multiple datasets with equal weights:
env = ReasoningGymEnv(
    gym=["arc_1d", "elementary_algebra", "propositional_logic"],
    num_train_examples=3000,  # 1000 per dataset
    num_eval_examples=300,    # 100 per dataset
    seed=0,
)

Weighted Composite

Mix datasets with custom weights:
env = ReasoningGymEnv(
    gym=[
        {"name": "arc_1d", "weight": 0.5, "config": {}},
        {"name": "elementary_algebra", "weight": 0.3, "config": {}},
        {"name": "propositional_logic", "weight": 0.2, "config": {}},
    ],
    num_train_examples=1000,
    seed=0,
)

Custom Parser

By default, ReasoningGymEnv uses XMLParser with <think> and <answer> fields. Override with a custom parser:
custom_parser = vf.XMLParser(
    fields=["reasoning", "solution"],
    answer_field="solution"
)

env = ReasoningGymEnv(
    gym="arc_1d",
    parser=custom_parser,
    num_train_examples=1000,
)

Custom System Prompt

from reasoning_gym.utils import SYSTEM_PROMPTS

env = ReasoningGymEnv(
    gym="arc_1d",
    system_prompt=SYSTEM_PROMPTS["chain_of_thought"],
    num_train_examples=1000,
)
Available system prompts:
  • "default" - Basic reasoning prompt
  • "chain_of_thought" - Encourage step-by-step reasoning
  • "concise" - Encourage brief responses

Scoring

Reasoning Gym tasks have built-in scoring functions. ReasoningGymEnv automatically:
  1. Parses the model’s answer field
  2. Calls the task-specific score_answer() function
  3. Returns a score (typically 0.0 or 1.0)
Format reward (XML compliance) is tracked separately with weight 0.

Full Example

import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv

def load_environment(
    gym: str | list = "arc_1d",
    num_train_examples: int = 2000,
    num_eval_examples: int = 500,
    seed: int = 0,
) -> vf.Environment:
    """Load a Reasoning Gym environment.
    
    Args:
        gym: Dataset name, list of names, or composite spec
        num_train_examples: Number of training examples
        num_eval_examples: Number of eval examples
        seed: Random seed for generation
    """
    return ReasoningGymEnv(
        gym=gym,
        num_train_examples=num_train_examples,
        num_eval_examples=num_eval_examples,
        seed=seed,
    )
With composite datasets:
def load_environment():
    return ReasoningGymEnv(
        gym=[
            {"name": "arc_1d", "weight": 0.4},
            {"name": "elementary_algebra", "weight": 0.3},
            {"name": "propositional_logic", "weight": 0.2},
            {"name": "zebra_puzzle", "weight": 0.1},
        ],
        num_train_examples=2000,
        num_eval_examples=400,
        seed=0,
    )

Expected Format

Models should respond with XML-formatted answers:
<think>
Let me analyze the pattern:
- First element: 1
- Second element: 2
- Third element: 4

This appears to be powers of 2.
</think>

<answer>
8
</answer>
The <answer> field is extracted and passed to the task scorer.

Metrics

MetricMeaning
rewardTask-specific score (0.0 or 1.0)
format_rewardXML format compliance (weight 0)

Best Practices

Start with a single dataset to understand task difficulty before mixing multiple datasets.
  • Validate baseline - Test with a strong model first to ensure tasks are solvable
  • Match difficulty - Mix tasks of similar difficulty for stable training
  • Use composite carefully - Large differences in task difficulty can hurt training
  • Set appropriate seeds - Different seeds generate different task variations

Comparison with Raw Reasoning Gym

Using Reasoning Gym directly:
import reasoning_gym as rg

dataset = rg.create_dataset("arc_1d", size=100, seed=0)
for entry in dataset:
    question = entry["question"]
    # ... run model, parse answer
    score = dataset.score_answer(answer, entry)
Using ReasoningGymEnv:
env = ReasoningGymEnv(gym="arc_1d", num_train_examples=100, seed=0)
# Handles dataset creation, parsing, scoring automatically

Examples

See the reasoning-gym-env example in the Verifiers repository.

Further Reading

Build docs developers (and LLMs) love