Building Single-Turn Environments

Single-turn environments are the simplest type of environment in Verifiers, designed for tasks where the model provides a single response to each prompt. They’re ideal for Q&A tasks, math problems, text transformations, and other one-shot challenges.

Overview

SingleTurnEnv is a specialized version of MultiTurnEnv with max_turns=1. Each rollout follows this simple pattern:

Send the prompt to the model
Receive a single response (the completion)
Score the response using reward functions

No multi-turn interaction, no tools, no complex state management—just prompt, response, and reward.

Your First Environment

Here’s a minimal single-turn environment for math problems:

import verifiers as vf
from datasets import Dataset

def load_environment():
    # Define your task data
    dataset = Dataset.from_list([
        {"prompt": [{"role": "user", "content": "What is 2+2?"}], "answer": "4"},
        {"prompt": [{"role": "user", "content": "What is 3*5?"}], "answer": "15"},
    ])
    
    # Define your reward function
    async def correct_answer(completion, answer) -> float:
        response = completion[-1]["content"]
        return 1.0 if answer in response else 0.0
    
    # Create rubric and environment
    rubric = vf.Rubric(funcs=[correct_answer])
    return vf.SingleTurnEnv(dataset=dataset, rubric=rubric)

Initialize Your Environment

Create a new environment project:

prime env init my-math-env
cd environments/my_math_env

Build Your Dataset

You can build datasets in several ways:

Direct Prompts

from datasets import Dataset

dataset = Dataset.from_list([
    {
        "prompt": [{"role": "user", "content": "What is 2+2?"}],
        "answer": "4"
    },
])

The prompt field contains a list of messages ready to send to the model.

Question Column

dataset = Dataset.from_list([
    {"question": "What is 2+2?", "answer": "4"},
])

The environment automatically wraps question strings in a user message.

From Hugging Face

from datasets import load_dataset

dataset = load_dataset("gsm8k", "main", split="train")
dataset = dataset.map(lambda x: {
    "question": x["question"],
    "answer": x["answer"],
})

Load existing datasets and map to the expected format.

Define Reward Functions

Reward functions score model responses. They request data by naming arguments:

async def correct_answer(completion, answer) -> float:
    """Check if the answer appears in the response."""
    response = completion[-1]["content"]
    return 1.0 if answer in response else 0.0

Available arguments:

completion — model’s output (list of messages)

prompt — input messages

answer — from dataset row

info — structured metadata from dataset

state — full rollout state

Create Your Environment

Combine everything in load_environment():

import verifiers as vf
from datasets import load_dataset

def load_environment():
    dataset = load_dataset("gsm8k", "main", split="train")
    
    async def correct_answer(completion, answer) -> float:
        response = completion[-1]["content"]
        return 1.0 if answer in response else 0.0
    
    rubric = vf.Rubric(funcs=[correct_answer])
    
    return vf.SingleTurnEnv(
        dataset=dataset,
        system_prompt="You are a helpful math tutor.",
        rubric=rubric,
    )

Install and Test

Install your environment and run a quick evaluation:

prime env install my-math-env
prime eval run my-math-env -m gpt-4.1-mini -n 5

Expected output:

Running evaluation on my-math-env with gpt-4.1-mini
Progress: 5/5 examples, 15/15 rollouts
Reward: 0.87 ± 0.12

Real Example: Text Reversal

Let’s examine the reverse-text environment from the repository:

environments/reverse_text/reverse_text.py

from datasets import load_dataset
import verifiers as vf

def load_environment(
    dataset_name: str = "PrimeIntellect/Reverse-Text-RL",
    dataset_split: str = "train",
    system_prompt: str | None = "Reverse the text character-by-character. Put your answer in <reversed_text> tags.",
) -> vf.Environment:
    train_dataset = load_dataset(dataset_name, split=dataset_split).map(
        lambda x: {
            "question": x["prompt"],
            "answer": x["prompt"][::-1],
            "info": {},
            "task": "reverse-text",
        }
    )
    train_dataset = train_dataset.remove_columns(["prompt"])

    parser = vf.XMLParser(["reversed_text"], answer_field="reversed_text")

    def lcs_reward_func(completion, answer, **kwargs) -> float:
        """LCS ratio of the reversed prompt and the parsed completion."""
        from difflib import SequenceMatcher
        response = parser.parse_answer(completion) or ""
        return SequenceMatcher(None, response, answer).ratio()

    rubric = vf.Rubric(funcs=[lcs_reward_func], weights=[1.0])

    return vf.SingleTurnEnv(
        dataset=train_dataset,
        system_prompt=system_prompt,
        parser=parser,
        rubric=rubric,
    )

Key features:

Uses XMLParser to extract structured output from <reversed_text> tags
Computes continuous reward based on longest common subsequence
Allows customization via system_prompt parameter

Advanced Patterns

Multiple Reward Functions

Combine multiple scoring criteria with custom weights:

async def check_keywords(completion, info) -> float:
    """Check for required keywords."""
    response = completion[-1]["content"]
    keywords = info["required_keywords"]
    found = sum(1 for kw in keywords if kw.lower() in response.lower())
    return found / len(keywords)

async def length_reward(completion) -> float:
    """Reward concise responses."""
    response = completion[-1]["content"]
    return 1.0 if len(response) < 500 else 0.5

rubric = vf.Rubric(
    funcs=[check_keywords, length_reward],
    weights=[1.0, 0.1]  # keyword match is primary, length is secondary
)

The final reward is the weighted sum: reward = 1.0 * check_keywords + 0.1 * length_reward

Parsing Structured Output

Use parsers to extract specific fields from model responses:

parser = vf.XMLParser(["reasoning", "answer"], answer_field="answer")

async def correct_with_reasoning(completion, answer, parser) -> float:
    parsed = parser.parse_answer(completion)
    # Access parsed.reasoning and parsed.answer
    return 1.0 if answer in parsed.answer else 0.0

rubric = vf.Rubric(funcs=[correct_with_reasoning], parser=parser)
vf_env = vf.SingleTurnEnv(dataset=dataset, parser=parser, rubric=rubric)

Lazy Dataset Loading

For large datasets, defer loading until first access:

from datasets import load_dataset
import verifiers as vf

def get_dataset_builder(split: str = "train", seed: int = 42):
    """Returns a builder that lazily loads the dataset."""
    def build():
        ds = load_dataset("my-dataset", split=split)
        ds = ds.shuffle(seed=seed)
        return ds
    return build

def load_environment():
    dataset_builder = get_dataset_builder(split="train")
    eval_builder = get_dataset_builder(split="test")
    
    return vf.SingleTurnEnv(
        dataset=dataset_builder,      # built on first access
        eval_dataset=eval_builder,    # built on first access
        rubric=rubric,
    )

Benefits:

Avoid loading large datasets during environment initialization
Better performance when running multiple replicas
Parameterize dataset creation (splits, shuffling, filtering)

Metrics and Observability

Track additional metrics without affecting the reward:

async def response_length(completion) -> float:
    return float(len(completion[-1]["content"]))

async def has_reasoning(completion) -> float:
    content = completion[-1]["content"]
    return 1.0 if "because" in content.lower() else 0.0

rubric = vf.Rubric(funcs=[correct_answer])  # only this affects reward
rubric.add_metric(response_length)          # weight=0 (tracking only)
rubric.add_metric(has_reasoning)            # weight=0 (tracking only)

All metrics appear in evaluation results:

{
  "reward": 0.8,
  "correct_answer": 0.8,
  "response_length": 127.3,
  "has_reasoning": 0.6
}

Evaluation Datasets

Provide separate train and evaluation datasets:

def load_environment():
    train_dataset = load_dataset("my-dataset", split="train")
    eval_dataset = load_dataset("my-dataset", split="test")
    
    return vf.SingleTurnEnv(
        dataset=train_dataset,
        eval_dataset=eval_dataset,
        rubric=rubric,
    )

When you run prime eval run, the evaluation dataset is used automatically.

Common Patterns

Math Verification

Use symbolic math checking with the built-in MathRubric:

import verifiers as vf

def extract_boxed_answer(completion):
    import re
    match = re.search(r'\\boxed\{(.+?)\}', completion[-1]["content"])
    return match.group(1) if match else ""

parser = vf.Parser(extract_fn=extract_boxed_answer)
math_rubric = vf.MathRubric(parser=parser)  # Uses math-verify library

vf_env = vf.SingleTurnEnv(
    dataset=dataset,
    system_prompt="Solve the problem and put your answer in \\boxed{}.",
    parser=parser,
    rubric=math_rubric,
)

LLM-as-Judge

Use another LLM to score responses:

import verifiers as vf

judge_rubric = vf.JudgeRubric(
    judge_model="gpt-4.1-mini",
    judge_prompt="""Is this response correct?
    
    Question: {question}
    Ground truth: {answer}
    Response: {response}
    
    Answer 'yes' or 'no'."""
)

async def judge_reward(prompt, completion, answer, judge) -> float:
    verdict = await judge(prompt, completion, answer)
    return 1.0 if "yes" in verdict.lower() else 0.0

judge_rubric.add_reward_func(judge_reward)

vf_env = vf.SingleTurnEnv(dataset=dataset, rubric=judge_rubric)

Combining Multiple Rubrics

Use RubricGroup to combine different scoring approaches:

# Symbolic math verification
math_rubric = vf.MathRubric()

# LLM judge for reasoning quality
judge_rubric = vf.JudgeRubric(judge_model="gpt-4.1-mini")
judge_rubric.add_reward_func(judge_reasoning_quality, weight=0.5)

# Combine both
rubric = vf.RubricGroup([math_rubric, judge_rubric])

vf_env = vf.SingleTurnEnv(dataset=dataset, rubric=rubric)

Final reward = math_rubric.reward + judge_rubric.reward

Testing Your Environment

After implementing your environment:

Install locally

prime env install my-env

Run a quick evaluation

prime eval run my-env -m gpt-4.1-mini -n 10 -r 3

This runs 10 examples with 3 rollouts each (30 total rollouts).

Check the output

Expected output:

Loading environment: my-env
Running 10 examples × 3 rollouts = 30 total rollouts
Progress: ████████████████████ 30/30 (100%)

Results:
  Reward: 0.73 ± 0.15
  correct_answer: 0.73 ± 0.15
  response_length: 142.3 ± 45.2

Save and inspect results

prime eval run my-env -m gpt-4.1-mini -n 10 -s

Results saved to ./environments/my_env/outputs/evals/my-env--gpt-4.1-mini/{run_id}/:

results.jsonl - detailed rollout data

metadata.json - configuration and metrics

Next Steps

Multi-turn environments: Add turn-by-turn interaction → Multi-Turn Guide
Tool use: Give your agent access to tools → Tool Environments Guide
Training: Use your environment for RL training → Training Guide

Get Started

Core Concepts

Guides

Integrations

Building Single-Turn Environments

Overview

Your First Environment

Real Example: Text Reversal

Advanced Patterns

Multiple Reward Functions

Parsing Structured Output

Lazy Dataset Loading

Metrics and Observability

Evaluation Datasets

Common Patterns

Math Verification

LLM-as-Judge

Combining Multiple Rubrics

Testing Your Environment

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

​Overview

​Your First Environment

​Real Example: Text Reversal

​Advanced Patterns

​Multiple Reward Functions

​Parsing Structured Output

​Lazy Dataset Loading

​Metrics and Observability

​Evaluation Datasets

​Common Patterns

​Math Verification

​LLM-as-Judge

​Combining Multiple Rubrics

​Testing Your Environment

​Next Steps

Build docs developers (and LLMs) love

Overview

Your First Environment

Real Example: Text Reversal

Advanced Patterns

Multiple Reward Functions

Parsing Structured Output

Lazy Dataset Loading

Metrics and Observability

Evaluation Datasets

Common Patterns

Math Verification

LLM-as-Judge

Combining Multiple Rubrics

Testing Your Environment

Next Steps