EnvGroup
Environment that combines multiple environments into a single mixture, routing rollouts to the appropriate environment based on the task field.
Overview
EnvGroup enables:
- Training on multiple tasks: Combine different environments into one training dataset
- Task-based routing: Each rollout is routed to the correct environment based on
task field
- Unified metrics: Aggregate metrics across all environments
- Shared configuration: Apply settings to all sub-environments at once
Inheritance
Constructor
EnvGroup(
envs: list[vf.Environment],
env_names: list[str] | None = None,
map_kwargs: dict = {},
**kwargs
)
Parameters
envs
list[vf.Environment]
required
List of environment instances to combine. Must contain at least one environment.
Optional names for each environment used for task routing. If not provided, uses "env_0", "env_1", etc.
Keyword arguments passed to HuggingFace dataset .map() operations.
All other parameters are inherited from Environment.
Behavior
Dataset Concatenation
EnvGroup concatenates the datasets from all sub-environments:
- Automatically builds datasets from each environment
- Overrides the
task column to use env_names for routing
- Ensures unique
example_id across all examples
Task Routing
Each rollout is routed based on the task field in the input:
env = EnvGroup(
envs=[env_a, env_b],
env_names=["task_a", "task_b"]
)
# Input with task="task_a" → routed to env_a
# Input with task="task_b" → routed to env_b
Metric Aggregation
All environments’ reward functions are tracked:
- If an environment doesn’t have a metric, it gets 0.0 for that metric
- All states include all metric names across all environments
- Enables fair comparison across different task types
Core Methods
rollout
async def rollout(
input: RolloutInput,
client: Client,
model: str,
sampling_args: SamplingArgs | None = None
) -> vf.State
Routes to the appropriate environment based on input["task"].
get_env_for_task
def get_env_for_task(task: str) -> vf.Environment
Get the environment instance for a given task name.
Task identifier from the dataset.
Returns: vf.Environment - Environment for that task, or the first environment if task not found.
set_max_seq_len
def set_max_seq_len(max_seq_len: int | None) -> None
Set max sequence length for the group and all sub-environments.
set_score_rollouts
def set_score_rollouts(score_rollouts: bool) -> None
Set score_rollouts flag for the group and all sub-environments.
Example Usage
Combining Q&A and Math
import verifiers as vf
from datasets import load_dataset
def load_environment():
# QA environment
qa_dataset = load_dataset("squad", split="train[:100]")
qa_env = vf.SingleTurnEnv(
dataset=qa_dataset,
rubric=vf.Rubric(lambda answer, completion: 1.0 if answer in str(completion) else 0.0),
system_prompt="Answer the question based on the context."
)
# Math environment
math_dataset = load_dataset("gsm8k", "main", split="train[:100]")
math_env = vf.SingleTurnEnv(
dataset=math_dataset,
rubric=vf.Rubric(lambda answer, completion: 1.0 if answer in str(completion) else 0.0),
system_prompt="Solve the math problem."
)
# Combine into mixture
return vf.EnvGroup(
envs=[qa_env, math_env],
env_names=["qa", "math"]
)
# Usage
env = load_environment()
results = await env.evaluate(
client=vf.ClientConfig(provider="openai", api_key="sk-..."),
model="gpt-4",
num_examples=200 # 100 from each task
)
# Results include metrics from both environments
print(f"Overall accuracy: {results['metadata']['avg_reward']}")
print(f"Total examples: {results['metadata']['num_examples']}")
Different Environment Types
import verifiers as vf
def calculator_tool(expression: str) -> float:
"""Evaluate math expression."""
return eval(expression)
def load_environment():
# Single-turn QA
qa_env = vf.SingleTurnEnv(
dataset=qa_dataset,
rubric=vf.Rubric(qa_reward),
)
# Tool-using environment
math_env = vf.ToolEnv(
tools=[calculator_tool],
dataset=math_dataset,
rubric=vf.Rubric(math_reward),
max_turns=5
)
# Multi-turn game
game_env = MyGameEnv(
dataset=game_dataset,
rubric=vf.Rubric(game_reward),
max_turns=20
)
return vf.EnvGroup(
envs=[qa_env, math_env, game_env],
env_names=["qa", "math", "game"]
)
With Custom Reward Functions
import verifiers as vf
def load_environment():
# Environment A: Correctness only
env_a = vf.SingleTurnEnv(
dataset=dataset_a,
rubric=vf.Rubric(
lambda answer, completion: 1.0 if answer in str(completion) else 0.0
)
)
# Environment B: Correctness + length penalty
def correctness_b(answer, completion):
return 1.0 if answer in str(completion) else 0.0
def length_b(completion):
return len(str(completion))
env_b = vf.SingleTurnEnv(
dataset=dataset_b,
rubric=vf.Rubric(correctness_b, length_b)
)
group = vf.EnvGroup(
envs=[env_a, env_b],
env_names=["simple", "complex"]
)
# All outputs will have both metrics:
# - correctness_b (0.0 for env_a examples)
# - length_b (0.0 for env_a examples)
return group
# Metrics in results
results = await env.evaluate(...)
for output in results["outputs"]:
print(f"Task: {output['task']}")
print(f"Metrics: {output['metrics']}")
# All outputs have all metric names, even if 0.0
Shared Configuration
import verifiers as vf
def load_environment():
env_a = vf.SingleTurnEnv(dataset=dataset_a, rubric=rubric_a)
env_b = vf.SingleTurnEnv(dataset=dataset_b, rubric=rubric_b)
env_c = vf.SingleTurnEnv(dataset=dataset_c, rubric=rubric_c)
group = vf.EnvGroup(
envs=[env_a, env_b, env_c],
env_names=["a", "b", "c"]
)
# Set max_seq_len for all environments
group.set_max_seq_len(2048)
# Disable scoring for all environments
group.set_score_rollouts(False)
return group
Weighted Sampling (Manual)
import verifiers as vf
from datasets import concatenate_datasets
def load_environment():
env_a = vf.SingleTurnEnv(dataset=dataset_a, rubric=rubric_a)
env_b = vf.SingleTurnEnv(dataset=dataset_b, rubric=rubric_b)
# Manually control dataset sizes before creating group
dataset_a_repeated = concatenate_datasets([dataset_a] * 3) # 3x weight
dataset_b_repeated = dataset_b # 1x weight
env_a_weighted = vf.SingleTurnEnv(dataset=dataset_a_repeated, rubric=rubric_a)
env_b_weighted = vf.SingleTurnEnv(dataset=dataset_b_repeated, rubric=rubric_b)
return vf.EnvGroup(
envs=[env_a_weighted, env_b_weighted],
env_names=["a", "b"]
)
Dataset Builders with EnvGroup
import verifiers as vf
def load_environment():
# Use DatasetBuilder pattern for lazy loading
def build_qa_dataset():
return load_dataset("squad", split="train")
def build_math_dataset():
return load_dataset("gsm8k", "main", split="train")
qa_env = vf.SingleTurnEnv(
dataset=build_qa_dataset, # Callable
rubric=vf.Rubric(qa_reward)
)
math_env = vf.SingleTurnEnv(
dataset=build_math_dataset, # Callable
rubric=vf.Rubric(math_reward)
)
# EnvGroup will trigger dataset building when needed
return vf.EnvGroup(
envs=[qa_env, math_env],
env_names=["qa", "math"]
)
Built-in Rubric
EnvGroup includes EnvGroupRubric which:
- Routes scoring to the appropriate environment’s rubric based on task
- Aggregates all reward function names across all environments
- Ensures all states have all metric names (0.0 for missing metrics)
Common Patterns
Task Distribution Analysis
results = await env.evaluate(...)
# Count examples per task
from collections import Counter
task_counts = Counter(output["task"] for output in results["outputs"])
print(f"Task distribution: {task_counts}")
# Accuracy per task
from collections import defaultdict
task_rewards = defaultdict(list)
for output in results["outputs"]:
task_rewards[output["task"]].append(output["reward"])
for task, rewards in task_rewards.items():
avg_reward = sum(rewards) / len(rewards)
print(f"{task}: {avg_reward:.2%} ({len(rewards)} examples)")
Filter by Task
results = await env.evaluate(...)
# Get only math task results
math_outputs = [o for o in results["outputs"] if o["task"] == "math"]
# Compute task-specific metrics
math_reward = sum(o["reward"] for o in math_outputs) / len(math_outputs)
print(f"Math accuracy: {math_reward:.2%}")
Dynamic Environment Creation
import verifiers as vf
def create_task_env(task_name: str, dataset, reward_fn):
return vf.SingleTurnEnv(
dataset=dataset,
rubric=vf.Rubric(reward_fn),
system_prompt=f"Solve {task_name} tasks."
)
def load_environment():
tasks = [
("qa", qa_dataset, qa_reward),
("math", math_dataset, math_reward),
("code", code_dataset, code_reward),
]
envs = [create_task_env(name, ds, reward) for name, ds, reward in tasks]
env_names = [name for name, _, _ in tasks]
return vf.EnvGroup(envs=envs, env_names=env_names)
When to Use
Use EnvGroup for:
- Multi-task training
- Curriculum learning with different task types
- Combining benchmarks into a single evaluation
- Training generalist models across diverse tasks
Avoid EnvGroup if:
- You only have one task
- Tasks require completely different model architectures
- You want to train separate models per task
See Also