GymEnv
A universal adapter for running OpenAI Gym-compatible environments with language models.
GymEnv is experimental and subject to breaking changes. The API may change in future releases.
Overview
GymEnv bridges the gap between Gym’s step-based API and Verifiers’ message-based rollout system. It:
- Converts Gym observations to text prompts
- Parses model completions into actions
- Manages episode lifecycle (reset/step/done)
- Computes episodic rewards
Inheritance
Environment
└── MultiTurnEnv
└── GymEnv
Constructor
GymEnv(
env_cls: type[StepResetEnv],
env_kwargs: dict[str, Any] | None = None,
action_parser: Callable[[str], Any] | None = None,
obs_to_text: Callable[[Any], str] | None = None,
num_train_episodes: int = 1000,
num_eval_episodes: int = 20,
max_episode_steps: int | None = None,
seed: int = 0,
system_prompt: str | None = None,
few_shot: list[dict[str, Any]] | None = None,
parser: vf.Parser | None = None,
rubric: Rubric | None = None,
message_type: MessageType = "chat",
)
env_cls
type[StepResetEnv]
required
Gym environment class with reset(seed) and step(action) methods.
env_kwargs
dict[str, Any] | None
default:"None"
Keyword arguments passed to env_cls() constructor.
action_parser
Callable[[str], Any] | None
default:"None"
Function to parse model output into an action. Defaults to identity (string actions).
obs_to_text
Callable[[Any], str] | None
default:"None"
Function to convert observations to text. Defaults to str(obs).
Number of episodes in training dataset.
Number of episodes in eval dataset.
Maximum steps per episode. If None, uses 1000.
Random seed for episode generation.
System prompt explaining the task.
Custom rubric for scoring. Defaults to EpisodicSumRubric().
StepResetEnv Protocol
Gym environments must implement:
class StepResetEnv(Protocol):
def reset(self, seed: int) -> obs | tuple[obs, dict]:
"""Reset environment with seed."""
...
def step(self, action) -> tuple[obs, reward, done, info] | tuple[obs, reward, done, truncated, info]:
"""Take action and return new state."""
...
Supports both old (4-tuple) and new (5-tuple) Gym APIs.
Example Usage
CartPole Example
import verifiers as vf
import gymnasium as gym
def load_environment():
def action_parser(text: str) -> int:
"""Parse '0' or '1' from model output."""
text = text.strip().lower()
if '0' in text:
return 0
elif '1' in text:
return 1
else:
raise ValueError(f"Invalid action: {text}")
def obs_to_text(obs) -> str:
"""Format CartPole observation."""
cart_pos, cart_vel, pole_angle, pole_vel = obs
return f"""CartPole State:
- Cart position: {cart_pos:.3f}
- Cart velocity: {cart_vel:.3f}
- Pole angle: {pole_angle:.3f}
- Pole velocity: {pole_vel:.3f}
Choose action (0=left, 1=right):"""
return vf.GymEnv(
env_cls=gym.make,
env_kwargs={"id": "CartPole-v1"},
action_parser=action_parser,
obs_to_text=obs_to_text,
num_train_episodes=100,
num_eval_episodes=10,
max_episode_steps=500,
system_prompt="You are controlling a CartPole. Balance the pole by moving left (0) or right (1).",
)
# Run evaluation
env = load_environment()
results = await env.evaluate(
client=vf.ClientConfig(api_key="..."),
model="gpt-4",
num_examples=10
)
print(f"Average episode reward: {results['metadata']['avg_reward']}")
Custom Text Game
import verifiers as vf
class TextAdventure:
"""Simple text-based game."""
def __init__(self):
self.location = "start"
self.inventory = []
self.steps = 0
def reset(self, seed: int):
self.location = "start"
self.inventory = []
self.steps = 0
return "You are in a dark room. Exits: north, south", {}
def step(self, action: str):
self.steps += 1
action = action.lower().strip()
if action == "north":
self.location = "treasure_room"
obs = "You found treasure! You win!"
reward = 1.0
done = True
elif action == "south":
obs = "You fell into a pit. Game over."
reward = 0.0
done = True
else:
obs = f"Invalid action '{action}'. Try: north, south"
reward = 0.0
done = False
truncated = self.steps >= 10
return obs, reward, done, truncated, {}
def load_environment():
return vf.GymEnv(
env_cls=TextAdventure,
num_train_episodes=50,
num_eval_episodes=10,
max_episode_steps=10,
system_prompt="Navigate the dungeon by typing commands.",
)
With Custom Parser
import verifiers as vf
import gymnasium as gym
import re
def load_environment():
def action_parser(text: str) -> int:
"""Extract numeric action from verbose output."""
# Model might say "I choose action 2"
match = re.search(r'\b([0-3])\b', text)
if match:
return int(match.group(1))
raise ValueError(f"No valid action in: {text}")
return vf.GymEnv(
env_cls=gym.make,
env_kwargs={"id": "LunarLander-v2"},
action_parser=action_parser,
num_train_episodes=100,
max_episode_steps=1000,
)
Built-in Rubric
EpisodicSumRubric
Default rubric that sums step rewards:
class EpisodicSumRubric(Rubric):
def __init__(self, weight: float = 1.0, **kwargs):
super().__init__(funcs=[sum_step_rewards], weights=[weight], **kwargs)
Accesses state["trajectory"] to sum per-step rewards from the environment.
Key Methods
gym_to_hf
def gym_to_hf(self) -> tuple[Dataset, Dataset | None]
Generates HuggingFace datasets by running reset() on each episode:
# Each row:
{
"question": obs_to_text(initial_obs),
"answer": str(seed), # Stored for reproducibility
}
obs_to_text
def obs_to_text(self, obs: Any) -> str
Converts observation to text. Override for custom formatting:
class CustomGymEnv(vf.GymEnv):
def obs_to_text(self, obs):
# Custom observation rendering
return f"Custom format: {obs}"
env_response
async def env_response(
self,
messages: vf.Messages,
state: State,
**kwargs
) -> vf.Messages | str
Executes env.step(action) and returns observation as user message.
State Keys
GymEnv adds:
Active Gym environment instance (created per rollout).
Whether episode has terminated.
Step-level reward from env.step().
Info dict returned by env.step().
Error Handling
Action parsing errors:
- Set
state["gym_done"] = True
- Return error message to model
- Assign 0.0 reward to that step
# If action_parser raises:
"Action Parsing Error: Invalid action: 'foo'"
Stop Conditions
Episode ends when:
- Gym returns
done=True or truncated=True
max_episode_steps is reached
- Action parsing fails
Advanced: Custom Reward
import verifiers as vf
import gymnasium as gym
def custom_reward(state: vf.State) -> float:
"""Bonus for efficiency."""
total_reward = sum(
step.get("reward", 0.0) for step in state["trajectory"]
)
num_steps = len(state["trajectory"])
efficiency_bonus = max(0, 1.0 - num_steps / 100)
return total_reward + efficiency_bonus
def load_environment():
return vf.GymEnv(
env_cls=gym.make,
env_kwargs={"id": "CartPole-v1"},
rubric=vf.Rubric(custom_reward),
)
Limitations
- Text-only: Observation must be convertible to text (no vision support)
- Synchronous: Gym envs are not async
- Single episode per rollout: Each rollout is one episode
When to Use
Use GymEnv for:
- Existing Gym environments
- Sequential decision-making tasks
- Reinforcement learning benchmarks
- Text-based games
For tool-based tasks, use ToolEnv instead.
See Also