Skip to main content

ReasoningGymEnv

Wrapper environment for Reasoning Gym procedural reasoning tasks.

Overview

ReasoningGymEnv wraps Reasoning Gym datasets for use in Verifiers. It supports both single datasets and composite mixtures, automatically handles scoring via Reasoning Gym’s built-in evaluators, and provides procedurally generated tasks. Key features:
  • Procedural task generation via seeds
  • Support for all Reasoning Gym datasets
  • Composite dataset mixing with custom weights
  • Automatic task-specific scoring
  • Built-in XML parser for structured responses

Installation

Install with Reasoning Gym support:
uv add 'verifiers[rg]'
Or when developing in the verifiers repo:
uv sync --extra rg
See the Reasoning Gym integration guide for setup details.

Inheritance

Environment
└── MultiTurnEnv
    └── SingleTurnEnv
        └── ReasoningGymEnv

Constructor

ReasoningGymEnv(
    gym: str | List[str | dict],
    num_train_examples: int = 1000,
    num_eval_examples: int = 100,
    system_prompt: str = DEFAULT_SYSTEM_PROMPT,
    parser: vf.Parser | None = None,
    seed: int = 0,
)

Parameters

gym
str | List[str | dict]
Dataset specification. Can be:
  • String: Single dataset name (e.g., "arc_1d")
  • List of strings: Multiple datasets with equal weights
  • List of dicts: Datasets with custom weights and configs using DatasetSpec format
num_train_examples
int
default:"1000"
Number of training examples to generate.
num_eval_examples
int
default:"100"
Number of evaluation examples to generate.
system_prompt
str
default:"DEFAULT_SYSTEM_PROMPT"
System prompt for the model. Defaults to Reasoning Gym’s default prompt.
parser
vf.Parser | None
default:"None"
Parser for model responses. If None, uses XMLParser(fields=["answer"]).
seed
int
default:"0"
Random seed for procedural generation.

Key Methods

build_rg_dataset

def build_rg_dataset(
    gym: str | List[str | dict],
    total_examples: int = 1000,
    seed: int = 0
) -> ProceduralDataset
Construct a Reasoning Gym dataset from the specification. Handles three formats:
  1. String: Single dataset → rg.create_dataset(gym, size=total_examples, seed=seed)
  2. List of strings: Multiple datasets with equal weights (1.0 each)
  3. List of dicts: Datasets with custom DatasetSpec configurations

rg_to_hf

def rg_to_hf(
    rg_dataset: ProceduralDataset
) -> Tuple[Dataset, Dataset]
Convert Reasoning Gym dataset to HuggingFace datasets for train and eval. Dataset format:
  • question: Task prompt from Reasoning Gym
  • answer: Index as string (used to retrieve entry for scoring)
  • task: Source dataset name from metadata

Scoring

ReasoningGymEnv automatically creates a rubric with a custom reward function:
async def check_answer_reward_func(
    completion: vf.Messages,
    answer: str,
    **kwargs
) -> float:
    # Get entry by index
    entry = self.rg_dataset[int(answer)]
    # Parse model response
    response = str(parser.parse_answer(completion)).strip()
    # Score using Reasoning Gym's scoring
    reward = self.rg_dataset.score_answer(answer=response, entry=entry)
    return reward
The reward function uses Reasoning Gym’s task-specific scoring, which varies by dataset (exact match, fuzzy match, numeric tolerance, etc.).

Example Usage

Single Dataset

import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv

def load_environment():
    return ReasoningGymEnv(
        gym="arc_1d",
        num_train_examples=1000,
        num_eval_examples=100,
        seed=0,
    )

Multiple Datasets (Equal Weights)

import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv

def load_environment():
    return ReasoningGymEnv(
        gym=["arc_1d", "gsm8k", "math_count"],
        num_train_examples=3000,  # 1000 per dataset
        num_eval_examples=300,     # 100 per dataset
        seed=42,
    )

Composite Dataset (Custom Weights)

import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv

def load_environment():
    # Use dict format for DatasetSpec
    return ReasoningGymEnv(
        gym=[
            {"name": "arc_1d", "weight": 2.0, "config": {}},
            {"name": "gsm8k", "weight": 1.0, "config": {}},
            {"name": "math_count", "weight": 0.5, "config": {}},
        ],
        num_train_examples=1000,
        num_eval_examples=100,
        seed=0,
    )

Custom Parser

import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv

def load_environment():
    # Use custom parser for chain-of-thought
    parser = vf.XMLParser(
        fields=["reasoning", "answer"],
        answer_field="answer"
    )
    
    return ReasoningGymEnv(
        gym="gsm8k",
        parser=parser,
        system_prompt="Solve the math problem step by step. Use <reasoning> for your work and <answer> for the final numerical answer.",
        num_train_examples=1000,
    )

With Custom System Prompt

import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv
from reasoning_gym.utils import SYSTEM_PROMPTS

def load_environment():
    # Use Reasoning Gym's CoT prompt
    return ReasoningGymEnv(
        gym="arc_1d",
        system_prompt=SYSTEM_PROMPTS["cot"],
        num_train_examples=1000,
    )

Available Datasets

Reasoning Gym provides many procedural datasets. Some popular ones:
  • Pattern Recognition: arc_1d, arc_2d
  • Math: gsm8k, math_count, number_theory
  • Logic: boolean_logic, propositional_logic
  • Sequences: sequence_next, sequence_missing
  • Spatial: grid_navigation, spatial_reasoning
Check the Reasoning Gym repository for the complete list.

DatasetSpec Format

When using composite datasets with custom weights, use this format:
{
    "name": str,        # Dataset name (e.g., "arc_1d")
    "weight": float,    # Sampling weight (default 1.0)
    "config": dict      # Dataset-specific config (default {})
}

Procedural Generation

All tasks are procedurally generated using seeds:
  • Each example gets a unique seed: seed + index
  • Same seed always generates the same task
  • Infinite variations possible
  • Reproducible across runs

See Also

Build docs developers (and LLMs) love