Overview
SingleTurnEnv is a specialized version of MultiTurnEnv with max_turns=1. Each rollout follows this simple pattern:
- Send the prompt to the model
- Receive a single response (the completion)
- Score the response using reward functions
Your First Environment
Here’s a minimal single-turn environment for math problems: Direct Prompts
prompt field contains a list of messages ready to send to the model. Question Column
question strings in a user message. From Hugging Face
async def correct_answer(completion, answer) -> float:
"""Check if the answer appears in the response."""
response = completion[-1]["content"]
return 1.0 if answer in response else 0.0
completion — model’s output (list of messages)prompt — input messagesanswer — from dataset rowinfo — structured metadata from datasetstate — full rollout stateimport verifiers as vf
from datasets import load_dataset
def load_environment():
dataset = load_dataset("gsm8k", "main", split="train")
async def correct_answer(completion, answer) -> float:
response = completion[-1]["content"]
return 1.0 if answer in response else 0.0
rubric = vf.Rubric(funcs=[correct_answer])
return vf.SingleTurnEnv(
dataset=dataset,
system_prompt="You are a helpful math tutor.",
rubric=rubric,
)
Real Example: Text Reversal
Let’s examine thereverse-text environment from the repository:
environments/reverse_text/reverse_text.py
- Uses
XMLParserto extract structured output from<reversed_text>tags - Computes continuous reward based on longest common subsequence
- Allows customization via
system_promptparameter
Advanced Patterns
Multiple Reward Functions
Combine multiple scoring criteria with custom weights:reward = 1.0 * check_keywords + 0.1 * length_reward
Parsing Structured Output
Use parsers to extract specific fields from model responses:Lazy Dataset Loading
For large datasets, defer loading until first access:- Avoid loading large datasets during environment initialization
- Better performance when running multiple replicas
- Parameterize dataset creation (splits, shuffling, filtering)
Metrics and Observability
Track additional metrics without affecting the reward:Evaluation Datasets
Provide separate train and evaluation datasets:prime eval run, the evaluation dataset is used automatically.
Common Patterns
Math Verification
Use symbolic math checking with the built-inMathRubric:
LLM-as-Judge
Use another LLM to score responses:Combining Multiple Rubrics
UseRubricGroup to combine different scoring approaches:
Testing Your Environment
After implementing your environment:Loading environment: my-env
Running 10 examples × 3 rollouts = 30 total rollouts
Progress: ████████████████████ 30/30 (100%)
Results:
Reward: 0.73 ± 0.15
correct_answer: 0.73 ± 0.15
response_length: 142.3 ± 45.2
Next Steps
- Multi-turn environments: Add turn-by-turn interaction → Multi-Turn Guide
- Tool use: Give your agent access to tools → Tool Environments Guide
- Training: Use your environment for RL training → Training Guide