EnvGroup

Environment that combines multiple environments into a single mixture, routing rollouts to the appropriate environment based on the task field.

Overview

EnvGroup enables:

Training on multiple tasks: Combine different environments into one training dataset
Task-based routing: Each rollout is routed to the correct environment based on task field
Unified metrics: Aggregate metrics across all environments
Shared configuration: Apply settings to all sub-environments at once

Inheritance

Environment
└── EnvGroup

Constructor

EnvGroup(
    envs: list[vf.Environment],
    env_names: list[str] | None = None,
    map_kwargs: dict = {},
    **kwargs
)

Parameters

envs

list[vf.Environment]

required

List of environment instances to combine. Must contain at least one environment.

env_names

list[str] | None

Optional names for each environment used for task routing. If not provided, uses "env_0", "env_1", etc.

map_kwargs

dict

default:"{}"

Keyword arguments passed to HuggingFace dataset .map() operations.

All other parameters are inherited from Environment.

Behavior

Dataset Concatenation

EnvGroup concatenates the datasets from all sub-environments:

Automatically builds datasets from each environment
Overrides the task column to use env_names for routing
Ensures unique example_id across all examples

Task Routing

Each rollout is routed based on the task field in the input:

env = EnvGroup(
    envs=[env_a, env_b],
    env_names=["task_a", "task_b"]
)

# Input with task="task_a" → routed to env_a
# Input with task="task_b" → routed to env_b

Metric Aggregation

All environments’ reward functions are tracked:

If an environment doesn’t have a metric, it gets 0.0 for that metric
All states include all metric names across all environments
Enables fair comparison across different task types

Core Methods

rollout

async def rollout(
    input: RolloutInput,
    client: Client,
    model: str,
    sampling_args: SamplingArgs | None = None
) -> vf.State

Routes to the appropriate environment based on input["task"].

get_env_for_task

def get_env_for_task(task: str) -> vf.Environment

Get the environment instance for a given task name.

task

str

Task identifier from the dataset.

Returns: vf.Environment - Environment for that task, or the first environment if task not found.

set_max_seq_len

def set_max_seq_len(max_seq_len: int | None) -> None

Set max sequence length for the group and all sub-environments.

set_score_rollouts

def set_score_rollouts(score_rollouts: bool) -> None

Set score_rollouts flag for the group and all sub-environments.

Example Usage

Combining Q&A and Math

import verifiers as vf
from datasets import load_dataset

def load_environment():
    # QA environment
    qa_dataset = load_dataset("squad", split="train[:100]")
    qa_env = vf.SingleTurnEnv(
        dataset=qa_dataset,
        rubric=vf.Rubric(lambda answer, completion: 1.0 if answer in str(completion) else 0.0),
        system_prompt="Answer the question based on the context."
    )
    
    # Math environment
    math_dataset = load_dataset("gsm8k", "main", split="train[:100]")
    math_env = vf.SingleTurnEnv(
        dataset=math_dataset,
        rubric=vf.Rubric(lambda answer, completion: 1.0 if answer in str(completion) else 0.0),
        system_prompt="Solve the math problem."
    )
    
    # Combine into mixture
    return vf.EnvGroup(
        envs=[qa_env, math_env],
        env_names=["qa", "math"]
    )

# Usage
env = load_environment()
results = await env.evaluate(
    client=vf.ClientConfig(provider="openai", api_key="sk-..."),
    model="gpt-4",
    num_examples=200  # 100 from each task
)

# Results include metrics from both environments
print(f"Overall accuracy: {results['metadata']['avg_reward']}")
print(f"Total examples: {results['metadata']['num_examples']}")

Different Environment Types

import verifiers as vf

def calculator_tool(expression: str) -> float:
    """Evaluate math expression."""
    return eval(expression)

def load_environment():
    # Single-turn QA
    qa_env = vf.SingleTurnEnv(
        dataset=qa_dataset,
        rubric=vf.Rubric(qa_reward),
    )
    
    # Tool-using environment
    math_env = vf.ToolEnv(
        tools=[calculator_tool],
        dataset=math_dataset,
        rubric=vf.Rubric(math_reward),
        max_turns=5
    )
    
    # Multi-turn game
    game_env = MyGameEnv(
        dataset=game_dataset,
        rubric=vf.Rubric(game_reward),
        max_turns=20
    )
    
    return vf.EnvGroup(
        envs=[qa_env, math_env, game_env],
        env_names=["qa", "math", "game"]
    )

With Custom Reward Functions

import verifiers as vf

def load_environment():
    # Environment A: Correctness only
    env_a = vf.SingleTurnEnv(
        dataset=dataset_a,
        rubric=vf.Rubric(
            lambda answer, completion: 1.0 if answer in str(completion) else 0.0
        )
    )
    
    # Environment B: Correctness + length penalty
    def correctness_b(answer, completion):
        return 1.0 if answer in str(completion) else 0.0
    
    def length_b(completion):
        return len(str(completion))
    
    env_b = vf.SingleTurnEnv(
        dataset=dataset_b,
        rubric=vf.Rubric(correctness_b, length_b)
    )
    
    group = vf.EnvGroup(
        envs=[env_a, env_b],
        env_names=["simple", "complex"]
    )
    
    # All outputs will have both metrics:
    # - correctness_b (0.0 for env_a examples)
    # - length_b (0.0 for env_a examples)
    return group

# Metrics in results
results = await env.evaluate(...)
for output in results["outputs"]:
    print(f"Task: {output['task']}")
    print(f"Metrics: {output['metrics']}")
    # All outputs have all metric names, even if 0.0

Shared Configuration

import verifiers as vf

def load_environment():
    env_a = vf.SingleTurnEnv(dataset=dataset_a, rubric=rubric_a)
    env_b = vf.SingleTurnEnv(dataset=dataset_b, rubric=rubric_b)
    env_c = vf.SingleTurnEnv(dataset=dataset_c, rubric=rubric_c)
    
    group = vf.EnvGroup(
        envs=[env_a, env_b, env_c],
        env_names=["a", "b", "c"]
    )
    
    # Set max_seq_len for all environments
    group.set_max_seq_len(2048)
    
    # Disable scoring for all environments
    group.set_score_rollouts(False)
    
    return group

Weighted Sampling (Manual)

import verifiers as vf
from datasets import concatenate_datasets

def load_environment():
    env_a = vf.SingleTurnEnv(dataset=dataset_a, rubric=rubric_a)
    env_b = vf.SingleTurnEnv(dataset=dataset_b, rubric=rubric_b)
    
    # Manually control dataset sizes before creating group
    dataset_a_repeated = concatenate_datasets([dataset_a] * 3)  # 3x weight
    dataset_b_repeated = dataset_b  # 1x weight
    
    env_a_weighted = vf.SingleTurnEnv(dataset=dataset_a_repeated, rubric=rubric_a)
    env_b_weighted = vf.SingleTurnEnv(dataset=dataset_b_repeated, rubric=rubric_b)
    
    return vf.EnvGroup(
        envs=[env_a_weighted, env_b_weighted],
        env_names=["a", "b"]
    )

Dataset Builders with EnvGroup

import verifiers as vf

def load_environment():
    # Use DatasetBuilder pattern for lazy loading
    def build_qa_dataset():
        return load_dataset("squad", split="train")
    
    def build_math_dataset():
        return load_dataset("gsm8k", "main", split="train")
    
    qa_env = vf.SingleTurnEnv(
        dataset=build_qa_dataset,  # Callable
        rubric=vf.Rubric(qa_reward)
    )
    
    math_env = vf.SingleTurnEnv(
        dataset=build_math_dataset,  # Callable
        rubric=vf.Rubric(math_reward)
    )
    
    # EnvGroup will trigger dataset building when needed
    return vf.EnvGroup(
        envs=[qa_env, math_env],
        env_names=["qa", "math"]
    )

Built-in Rubric

EnvGroup includes EnvGroupRubric which:

Routes scoring to the appropriate environment’s rubric based on task
Aggregates all reward function names across all environments
Ensures all states have all metric names (0.0 for missing metrics)

Common Patterns

Task Distribution Analysis

results = await env.evaluate(...)

# Count examples per task
from collections import Counter
task_counts = Counter(output["task"] for output in results["outputs"])
print(f"Task distribution: {task_counts}")

# Accuracy per task
from collections import defaultdict
task_rewards = defaultdict(list)
for output in results["outputs"]:
    task_rewards[output["task"]].append(output["reward"])

for task, rewards in task_rewards.items():
    avg_reward = sum(rewards) / len(rewards)
    print(f"{task}: {avg_reward:.2%} ({len(rewards)} examples)")

Filter by Task

results = await env.evaluate(...)

# Get only math task results
math_outputs = [o for o in results["outputs"] if o["task"] == "math"]

# Compute task-specific metrics
math_reward = sum(o["reward"] for o in math_outputs) / len(math_outputs)
print(f"Math accuracy: {math_reward:.2%}")

Dynamic Environment Creation

import verifiers as vf

def create_task_env(task_name: str, dataset, reward_fn):
    return vf.SingleTurnEnv(
        dataset=dataset,
        rubric=vf.Rubric(reward_fn),
        system_prompt=f"Solve {task_name} tasks."
    )

def load_environment():
    tasks = [
        ("qa", qa_dataset, qa_reward),
        ("math", math_dataset, math_reward),
        ("code", code_dataset, code_reward),
    ]
    
    envs = [create_task_env(name, ds, reward) for name, ds, reward in tasks]
    env_names = [name for name, _, _ in tasks]
    
    return vf.EnvGroup(envs=envs, env_names=env_names)

When to Use

Use EnvGroup for:

Multi-task training
Curriculum learning with different task types
Combining benchmarks into a single evaluation
Training generalist models across diverse tasks

Avoid EnvGroup if:

You only have one task
Tasks require completely different model architectures
You want to train separate models per task

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

EnvGroup

EnvGroup

Overview

Inheritance

Constructor

Parameters

Behavior

Dataset Concatenation

Task Routing

Metric Aggregation

Core Methods

rollout

get_env_for_task

set_max_seq_len

set_score_rollouts

Example Usage

Combining Q&A and Math

Different Environment Types

With Custom Reward Functions

Shared Configuration

Weighted Sampling (Manual)

Dataset Builders with EnvGroup

Built-in Rubric

Common Patterns

Task Distribution Analysis

Filter by Task

Dynamic Environment Creation

When to Use

See Also

Build docs developers (and LLMs) love

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​EnvGroup

​Overview

​Inheritance

​Constructor

​Parameters

​Behavior

​Dataset Concatenation

​Task Routing

​Metric Aggregation

​Core Methods

​rollout

​get_env_for_task

​set_max_seq_len

​set_score_rollouts

​Example Usage

​Combining Q&A and Math

​Different Environment Types

​With Custom Reward Functions

​Shared Configuration

​Weighted Sampling (Manual)

​Dataset Builders with EnvGroup

​Built-in Rubric

​Common Patterns

​Task Distribution Analysis

​Filter by Task

​Dynamic Environment Creation

​When to Use

​See Also

Build docs developers (and LLMs) love

EnvGroup

Overview

Inheritance

Constructor

Parameters

Behavior

Dataset Concatenation

Task Routing

Metric Aggregation

Core Methods

rollout

get_env_for_task

set_max_seq_len

set_score_rollouts

Example Usage

Combining Q&A and Math

Different Environment Types

With Custom Reward Functions

Shared Configuration

Weighted Sampling (Manual)

Dataset Builders with EnvGroup

Built-in Rubric

Common Patterns

Task Distribution Analysis

Filter by Task

Dynamic Environment Creation

When to Use

See Also