GymEnv

A universal adapter for running OpenAI Gym-compatible environments with language models.

GymEnv is experimental and subject to breaking changes. The API may change in future releases.

Overview

GymEnv bridges the gap between Gym’s step-based API and Verifiers’ message-based rollout system. It:

Converts Gym observations to text prompts
Parses model completions into actions
Manages episode lifecycle (reset/step/done)
Computes episodic rewards

Inheritance

Environment
└── MultiTurnEnv
    └── GymEnv

Constructor

GymEnv(
    env_cls: type[StepResetEnv],
    env_kwargs: dict[str, Any] | None = None,
    action_parser: Callable[[str], Any] | None = None,
    obs_to_text: Callable[[Any], str] | None = None,
    num_train_episodes: int = 1000,
    num_eval_episodes: int = 20,
    max_episode_steps: int | None = None,
    seed: int = 0,
    system_prompt: str | None = None,
    few_shot: list[dict[str, Any]] | None = None,
    parser: vf.Parser | None = None,
    rubric: Rubric | None = None,
    message_type: MessageType = "chat",
)

env_cls

type[StepResetEnv]

required

Gym environment class with reset(seed) and step(action) methods.

env_kwargs

dict[str, Any] | None

default:"None"

Keyword arguments passed to env_cls() constructor.

action_parser

Callable[[str], Any] | None

default:"None"

Function to parse model output into an action. Defaults to identity (string actions).

obs_to_text

Callable[[Any], str] | None

default:"None"

Function to convert observations to text. Defaults to str(obs).

num_train_episodes

int

default:"1000"

Number of episodes in training dataset.

num_eval_episodes

int

default:"20"

Number of episodes in eval dataset.

max_episode_steps

int | None

default:"None"

Maximum steps per episode. If None, uses 1000.

seed

int

default:"0"

Random seed for episode generation.

system_prompt

str | None

System prompt explaining the task.

rubric

Rubric | None

Custom rubric for scoring. Defaults to EpisodicSumRubric().

StepResetEnv Protocol

Gym environments must implement:

class StepResetEnv(Protocol):
    def reset(self, seed: int) -> obs | tuple[obs, dict]:
        """Reset environment with seed."""
        ...
    
    def step(self, action) -> tuple[obs, reward, done, info] | tuple[obs, reward, done, truncated, info]:
        """Take action and return new state."""
        ...

Supports both old (4-tuple) and new (5-tuple) Gym APIs.

Example Usage

CartPole Example

import verifiers as vf
import gymnasium as gym

def load_environment():
    def action_parser(text: str) -> int:
        """Parse '0' or '1' from model output."""
        text = text.strip().lower()
        if '0' in text:
            return 0
        elif '1' in text:
            return 1
        else:
            raise ValueError(f"Invalid action: {text}")
    
    def obs_to_text(obs) -> str:
        """Format CartPole observation."""
        cart_pos, cart_vel, pole_angle, pole_vel = obs
        return f"""CartPole State:
- Cart position: {cart_pos:.3f}
- Cart velocity: {cart_vel:.3f}
- Pole angle: {pole_angle:.3f}
- Pole velocity: {pole_vel:.3f}

Choose action (0=left, 1=right):"""
    
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "CartPole-v1"},
        action_parser=action_parser,
        obs_to_text=obs_to_text,
        num_train_episodes=100,
        num_eval_episodes=10,
        max_episode_steps=500,
        system_prompt="You are controlling a CartPole. Balance the pole by moving left (0) or right (1).",
    )

# Run evaluation
env = load_environment()
results = await env.evaluate(
    client=vf.ClientConfig(api_key="..."),
    model="gpt-4",
    num_examples=10
)

print(f"Average episode reward: {results['metadata']['avg_reward']}")

Custom Text Game

import verifiers as vf

class TextAdventure:
    """Simple text-based game."""
    
    def __init__(self):
        self.location = "start"
        self.inventory = []
        self.steps = 0
    
    def reset(self, seed: int):
        self.location = "start"
        self.inventory = []
        self.steps = 0
        return "You are in a dark room. Exits: north, south", {}
    
    def step(self, action: str):
        self.steps += 1
        
        action = action.lower().strip()
        
        if action == "north":
            self.location = "treasure_room"
            obs = "You found treasure! You win!"
            reward = 1.0
            done = True
        elif action == "south":
            obs = "You fell into a pit. Game over."
            reward = 0.0
            done = True
        else:
            obs = f"Invalid action '{action}'. Try: north, south"
            reward = 0.0
            done = False
        
        truncated = self.steps >= 10
        return obs, reward, done, truncated, {}

def load_environment():
    return vf.GymEnv(
        env_cls=TextAdventure,
        num_train_episodes=50,
        num_eval_episodes=10,
        max_episode_steps=10,
        system_prompt="Navigate the dungeon by typing commands.",
    )

With Custom Parser

import verifiers as vf
import gymnasium as gym
import re

def load_environment():
    def action_parser(text: str) -> int:
        """Extract numeric action from verbose output."""
        # Model might say "I choose action 2"
        match = re.search(r'\b([0-3])\b', text)
        if match:
            return int(match.group(1))
        raise ValueError(f"No valid action in: {text}")
    
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "LunarLander-v2"},
        action_parser=action_parser,
        num_train_episodes=100,
        max_episode_steps=1000,
    )

Built-in Rubric

EpisodicSumRubric

Default rubric that sums step rewards:

class EpisodicSumRubric(Rubric):
    def __init__(self, weight: float = 1.0, **kwargs):
        super().__init__(funcs=[sum_step_rewards], weights=[weight], **kwargs)

Accesses state["trajectory"] to sum per-step rewards from the environment.

Key Methods

gym_to_hf

def gym_to_hf(self) -> tuple[Dataset, Dataset | None]

Generates HuggingFace datasets by running reset() on each episode:

# Each row:
{
    "question": obs_to_text(initial_obs),
    "answer": str(seed),  # Stored for reproducibility
}

obs_to_text

def obs_to_text(self, obs: Any) -> str

Converts observation to text. Override for custom formatting:

class CustomGymEnv(vf.GymEnv):
    def obs_to_text(self, obs):
        # Custom observation rendering
        return f"Custom format: {obs}"

env_response

async def env_response(
    self,
    messages: vf.Messages,
    state: State,
    **kwargs
) -> vf.Messages | str

Executes env.step(action) and returns observation as user message.

State Keys

GymEnv adds:

gym_env

StepResetEnv

Active Gym environment instance (created per rollout).

gym_done

bool

Whether episode has terminated.

trajectory[i]['reward']

float

Step-level reward from env.step().

trajectory[i]['extras']['gym_info']

dict

Info dict returned by env.step().

Error Handling

Action parsing errors:

Set state["gym_done"] = True
Return error message to model
Assign 0.0 reward to that step

# If action_parser raises:
"Action Parsing Error: Invalid action: 'foo'"

Stop Conditions

Episode ends when:

Gym returns done=True or truncated=True
max_episode_steps is reached
Action parsing fails

Advanced: Custom Reward

import verifiers as vf
import gymnasium as gym

def custom_reward(state: vf.State) -> float:
    """Bonus for efficiency."""
    total_reward = sum(
        step.get("reward", 0.0) for step in state["trajectory"]
    )
    num_steps = len(state["trajectory"])
    efficiency_bonus = max(0, 1.0 - num_steps / 100)
    return total_reward + efficiency_bonus

def load_environment():
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "CartPole-v1"},
        rubric=vf.Rubric(custom_reward),
    )

Limitations

Text-only: Observation must be convertible to text (no vision support)
Synchronous: Gym envs are not async
Single episode per rollout: Each rollout is one episode

When to Use

Use GymEnv for:

Existing Gym environments
Sequential decision-making tasks
Reinforcement learning benchmarks
Text-based games

For tool-based tasks, use ToolEnv instead.

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

GymEnv

GymEnv

Overview

Inheritance

Constructor

StepResetEnv Protocol

Example Usage

CartPole Example

Custom Text Game

With Custom Parser

Built-in Rubric

EpisodicSumRubric

Key Methods

gym_to_hf

obs_to_text

env_response

State Keys

Error Handling

Stop Conditions

Advanced: Custom Reward

Limitations

When to Use

See Also

Build docs developers (and LLMs) love

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​GymEnv

​Overview

​Inheritance

​Constructor

​StepResetEnv Protocol

​Example Usage

​CartPole Example

​Custom Text Game

​With Custom Parser

​Built-in Rubric

​EpisodicSumRubric

​Key Methods

​gym_to_hf

​obs_to_text

​env_response

​State Keys

​Error Handling

​Stop Conditions

​Advanced: Custom Reward

​Limitations

​When to Use

​See Also

Build docs developers (and LLMs) love

GymEnv

Overview

Inheritance

Constructor

StepResetEnv Protocol

Example Usage

CartPole Example

Custom Text Game

With Custom Parser

Built-in Rubric

EpisodicSumRubric

Key Methods

gym_to_hf

obs_to_text

env_response

State Keys

Error Handling

Stop Conditions

Advanced: Custom Reward

Limitations

When to Use

See Also