Skip to main content

GymEnv

A universal adapter for running OpenAI Gym-compatible environments with language models.
GymEnv is experimental and subject to breaking changes. The API may change in future releases.

Overview

GymEnv bridges the gap between Gym’s step-based API and Verifiers’ message-based rollout system. It:
  • Converts Gym observations to text prompts
  • Parses model completions into actions
  • Manages episode lifecycle (reset/step/done)
  • Computes episodic rewards

Inheritance

Environment
└── MultiTurnEnv
    └── GymEnv

Constructor

GymEnv(
    env_cls: type[StepResetEnv],
    env_kwargs: dict[str, Any] | None = None,
    action_parser: Callable[[str], Any] | None = None,
    obs_to_text: Callable[[Any], str] | None = None,
    num_train_episodes: int = 1000,
    num_eval_episodes: int = 20,
    max_episode_steps: int | None = None,
    seed: int = 0,
    system_prompt: str | None = None,
    few_shot: list[dict[str, Any]] | None = None,
    parser: vf.Parser | None = None,
    rubric: Rubric | None = None,
    message_type: MessageType = "chat",
)
env_cls
type[StepResetEnv]
required
Gym environment class with reset(seed) and step(action) methods.
env_kwargs
dict[str, Any] | None
default:"None"
Keyword arguments passed to env_cls() constructor.
action_parser
Callable[[str], Any] | None
default:"None"
Function to parse model output into an action. Defaults to identity (string actions).
obs_to_text
Callable[[Any], str] | None
default:"None"
Function to convert observations to text. Defaults to str(obs).
num_train_episodes
int
default:"1000"
Number of episodes in training dataset.
num_eval_episodes
int
default:"20"
Number of episodes in eval dataset.
max_episode_steps
int | None
default:"None"
Maximum steps per episode. If None, uses 1000.
seed
int
default:"0"
Random seed for episode generation.
system_prompt
str | None
System prompt explaining the task.
rubric
Rubric | None
Custom rubric for scoring. Defaults to EpisodicSumRubric().

StepResetEnv Protocol

Gym environments must implement:
class StepResetEnv(Protocol):
    def reset(self, seed: int) -> obs | tuple[obs, dict]:
        """Reset environment with seed."""
        ...
    
    def step(self, action) -> tuple[obs, reward, done, info] | tuple[obs, reward, done, truncated, info]:
        """Take action and return new state."""
        ...
Supports both old (4-tuple) and new (5-tuple) Gym APIs.

Example Usage

CartPole Example

import verifiers as vf
import gymnasium as gym

def load_environment():
    def action_parser(text: str) -> int:
        """Parse '0' or '1' from model output."""
        text = text.strip().lower()
        if '0' in text:
            return 0
        elif '1' in text:
            return 1
        else:
            raise ValueError(f"Invalid action: {text}")
    
    def obs_to_text(obs) -> str:
        """Format CartPole observation."""
        cart_pos, cart_vel, pole_angle, pole_vel = obs
        return f"""CartPole State:
- Cart position: {cart_pos:.3f}
- Cart velocity: {cart_vel:.3f}
- Pole angle: {pole_angle:.3f}
- Pole velocity: {pole_vel:.3f}

Choose action (0=left, 1=right):"""
    
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "CartPole-v1"},
        action_parser=action_parser,
        obs_to_text=obs_to_text,
        num_train_episodes=100,
        num_eval_episodes=10,
        max_episode_steps=500,
        system_prompt="You are controlling a CartPole. Balance the pole by moving left (0) or right (1).",
    )

# Run evaluation
env = load_environment()
results = await env.evaluate(
    client=vf.ClientConfig(api_key="..."),
    model="gpt-4",
    num_examples=10
)

print(f"Average episode reward: {results['metadata']['avg_reward']}")

Custom Text Game

import verifiers as vf

class TextAdventure:
    """Simple text-based game."""
    
    def __init__(self):
        self.location = "start"
        self.inventory = []
        self.steps = 0
    
    def reset(self, seed: int):
        self.location = "start"
        self.inventory = []
        self.steps = 0
        return "You are in a dark room. Exits: north, south", {}
    
    def step(self, action: str):
        self.steps += 1
        
        action = action.lower().strip()
        
        if action == "north":
            self.location = "treasure_room"
            obs = "You found treasure! You win!"
            reward = 1.0
            done = True
        elif action == "south":
            obs = "You fell into a pit. Game over."
            reward = 0.0
            done = True
        else:
            obs = f"Invalid action '{action}'. Try: north, south"
            reward = 0.0
            done = False
        
        truncated = self.steps >= 10
        return obs, reward, done, truncated, {}

def load_environment():
    return vf.GymEnv(
        env_cls=TextAdventure,
        num_train_episodes=50,
        num_eval_episodes=10,
        max_episode_steps=10,
        system_prompt="Navigate the dungeon by typing commands.",
    )

With Custom Parser

import verifiers as vf
import gymnasium as gym
import re

def load_environment():
    def action_parser(text: str) -> int:
        """Extract numeric action from verbose output."""
        # Model might say "I choose action 2"
        match = re.search(r'\b([0-3])\b', text)
        if match:
            return int(match.group(1))
        raise ValueError(f"No valid action in: {text}")
    
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "LunarLander-v2"},
        action_parser=action_parser,
        num_train_episodes=100,
        max_episode_steps=1000,
    )

Built-in Rubric

EpisodicSumRubric

Default rubric that sums step rewards:
class EpisodicSumRubric(Rubric):
    def __init__(self, weight: float = 1.0, **kwargs):
        super().__init__(funcs=[sum_step_rewards], weights=[weight], **kwargs)
Accesses state["trajectory"] to sum per-step rewards from the environment.

Key Methods

gym_to_hf

def gym_to_hf(self) -> tuple[Dataset, Dataset | None]
Generates HuggingFace datasets by running reset() on each episode:
# Each row:
{
    "question": obs_to_text(initial_obs),
    "answer": str(seed),  # Stored for reproducibility
}

obs_to_text

def obs_to_text(self, obs: Any) -> str
Converts observation to text. Override for custom formatting:
class CustomGymEnv(vf.GymEnv):
    def obs_to_text(self, obs):
        # Custom observation rendering
        return f"Custom format: {obs}"

env_response

async def env_response(
    self,
    messages: vf.Messages,
    state: State,
    **kwargs
) -> vf.Messages | str
Executes env.step(action) and returns observation as user message.

State Keys

GymEnv adds:
gym_env
StepResetEnv
Active Gym environment instance (created per rollout).
gym_done
bool
Whether episode has terminated.
trajectory[i]['reward']
float
Step-level reward from env.step().
trajectory[i]['extras']['gym_info']
dict
Info dict returned by env.step().

Error Handling

Action parsing errors:
  • Set state["gym_done"] = True
  • Return error message to model
  • Assign 0.0 reward to that step
# If action_parser raises:
"Action Parsing Error: Invalid action: 'foo'"

Stop Conditions

Episode ends when:
  1. Gym returns done=True or truncated=True
  2. max_episode_steps is reached
  3. Action parsing fails

Advanced: Custom Reward

import verifiers as vf
import gymnasium as gym

def custom_reward(state: vf.State) -> float:
    """Bonus for efficiency."""
    total_reward = sum(
        step.get("reward", 0.0) for step in state["trajectory"]
    )
    num_steps = len(state["trajectory"])
    efficiency_bonus = max(0, 1.0 - num_steps / 100)
    return total_reward + efficiency_bonus

def load_environment():
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "CartPole-v1"},
        rubric=vf.Rubric(custom_reward),
    )

Limitations

  • Text-only: Observation must be convertible to text (no vision support)
  • Synchronous: Gym envs are not async
  • Single episode per rollout: Each rollout is one episode

When to Use

Use GymEnv for:
  • Existing Gym environments
  • Sequential decision-making tasks
  • Reinforcement learning benchmarks
  • Text-based games
For tool-based tasks, use ToolEnv instead.

See Also

Build docs developers (and LLMs) love