RL Training Guide - Verifiers

Verifiers environments are designed for reinforcement learning training. This guide covers training with Hosted Training (recommended), the open-source prime-rl trainer, and prompt optimization with GEPA.

Training Options

Three primary approaches:

Method	Best For	Infrastructure
Hosted Training	Production training, no GPU management	Managed by Prime Intellect
prime-rl	Self-hosted, large-scale training	Your GPU cluster
GEPA	Prompt optimization (no gradient training)	CPU/single GPU

Hosted Training

Hosted Training provides fully managed RL training infrastructure. You provide an environment and config, we handle the rest.

Getting Started

Setup workspace

prime lab setup

This downloads example configs to configs/rl/.

Choose a base config

Example configs:

gsm8k.toml - Math reasoning

math-python.toml - Code-based math

wordle.toml - Game playing

wiki-search.toml - Tool use

Configure your training run

Edit or create a config:

model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 500
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 512

[[env]]
id = "my-environment"
args = { difficulty = "medium" }

[wandb]
project = "my-project"
name = "my-run"

Submit training job

Submit via the Prime Lab UI or CLI:

prime train submit configs/rl/my-training.toml

Configuration Reference

# Model and training
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"  # Base model
max_steps = 500                             # Training steps
batch_size = 256                            # Samples per gradient update
rollouts_per_example = 8                    # Rollouts per example for advantage

# Sampling parameters
[sampling]
max_tokens = 512
temperature = 1.0

# Environment configuration
[[env]]
id = "primeintellect/alphabet-sort"
args = { min_turns = 3, max_turns = 5 }

# W&B logging
[wandb]
project = "alphabet-sort"
name = "qwen3-30b-alphabet-sort"

# Optional: environment variables for API keys
env_file = ["secrets.env"]

Supported Models

Hosted Training currently supports:

Qwen/Qwen3-4B-Instruct-2507
Qwen/Qwen3-4B-Thinking-2507
Qwen/Qwen3-30B-Instruct-2507
Qwen/Qwen3-30B-Thinking-2507
Qwen/Qwen3-235B-Instruct-2507
Qwen/Qwen3-235B-Thinking-2507
PrimeIntellect/INTELLECT-3

Hosted Training is currently in Private Beta. Request access.

Environment Variables

For environments requiring API keys (e.g., judge models):

Create a secrets file:

secrets.env

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Reference in config:

env_file = ["secrets.env"]

Or set via Lab UI when submitting the job

prime-rl: Self-Hosted Training

prime-rl is our production-ready async RL trainer for self-managed GPU infrastructure.

Setup

Clone and install

prime lab setup --prime-rl

This:

Clones the prime-rl repository

Installs dependencies

Sets up example configs in configs/prime-rl/

Configure training

Example config:

# Model and infrastructure
model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
num_gpus = 8
tensor_parallel_size = 4

# Training parameters
max_steps = 1000
batch_size = 512
rollouts_per_example = 16
learning_rate = 1e-5

# Environment
[env]
id = "wiki-search"
args = { max_turns = 10, judge_model = "gpt-4.1-mini" }

# Sampling
[sampling]
max_tokens = 1024
temperature = 1.0

# W&B
[wandb]
project = "wiki-search"
name = "qwen3-30b-wiki"

Launch training

uv run prime-rl configs/prime-rl/my-training.toml

This launches a tmux session with:

Inference server (vLLM)

Orchestrator (coordinates training)

Trainer (updates model weights)

Key Features

Async rollout generation: Non-blocking inference for maximum throughput
Continuous batching: Efficient GPU utilization
In-flight weight updates: Models update during rollout generation
Online difficulty filtering: Focus on appropriately challenging examples
LoRA support: Efficient fine-tuning for large models
MoE support: Mixture-of-Experts architectures

Configuration Options

# Core training
max_steps = 1000
batch_size = 512
rollouts_per_example = 16
learning_rate = 1e-5

# LoRA (optional)
use_lora = true
lora_rank = 64
lora_alpha = 128

# Difficulty filtering (optional)
use_difficulty_filtering = true
difficulty_threshold = 0.3  # Min reward variance required

# Sampling
[sampling]
max_tokens = 1024
temperature = 1.0
top_p = 0.9

# Infrastructure
num_gpus = 8
tensor_parallel_size = 4
pipeline_parallel_size = 1

For full documentation: prime-rl docs

GEPA: Prompt Optimization

GEPA (Genetic-Pareto) optimizes system prompts without gradient-based training, using a teacher LLM to iteratively improve prompts based on evaluation results.

Basic Usage

prime gepa run my-env --model google/gemini-3-flash-preview

This:

Runs initial evaluation with current prompt
Uses teacher model to propose improvements
Evaluates new prompts
Selects best prompts (Pareto frontier)
Repeats until budget exhausted

Configuration

Flag	Description	Default
`--model` / `-m`	Model for rollouts	Required
`--reflection-model` / `-M`	Teacher model for prompt refinement	Same as `--model`
`--max-calls` / `-B`	Evaluation budget	500
`--num-train` / `-n`	Training examples	100
`--num-val` / `-N`	Validation examples	50
`--minibatch-size`	Examples per reflection	3
`--perfect-score`	Max reward (skip if achieved)	None
`--state-columns`	Extra state fields for reflection	None

Example Workflow

Run optimization

prime gepa run wordle \
  --model google/gemini-3-flash-preview \
  --reflection-model google/gemini-3-exp-ultra-preview \
  --max-calls 1000 \
  --num-train 200 \
  --num-val 100

Check output

Results saved to environments/wordle/outputs/gepa/:

best_prompt.txt - Optimized system prompt

pareto_frontier.jsonl - Best prompts per validation example

metadata.json - Run configuration and summary

Use the optimized prompt

Copy the best prompt to your environment:

DEFAULT_SYSTEM_PROMPT = """<content from best_prompt.txt>"""

def load_environment(
    system_prompt: str = DEFAULT_SYSTEM_PROMPT,
    **kwargs
):
    return vf.SingleTurnEnv(
        dataset=dataset,
        system_prompt=system_prompt,
        rubric=rubric,
    )

Verify improvement

# Before optimization
prime eval run wordle -m google/gemini-3-flash-preview -n 100

# After optimization
prime eval run wordle \
  -m google/gemini-3-flash-preview \
  -n 100 \
  -a '{"system_prompt": "<optimized prompt>"}'

GEPA Configuration Files

Use TOML configs for reproducible optimization:

configs/gepa/my-optimization.toml

env_id = "my-env"
model = "google/gemini-3-flash-preview"
reflection_model = "google/gemini-3-exp-ultra-preview"
max_calls = 1000
num_train = 200
num_val = 100
minibatch_size = 5
perfect_score = 1.0
state_columns = ["parsed_answer", "tool_calls"]

Run:

prime gepa run configs/gepa/my-optimization.toml

RL Best Practices

Before Training

Evaluate baseline performance

Run evaluation to establish baseline:

prime eval run my-env -m base-model -n 100 -r 5

Target baselines:

Too easy: >80% success → task may be too simple

Good range: 10-70% success → ideal for RL

Too hard: Less than 5% success → model may need stronger base

Check reward diversity

Ensure varied rewards within groups:

prime eval run my-env -m base-model -n 20 -r 8 -s

Analyze results:

import json
import numpy as np

with open("results.jsonl") as f:
    rollouts = [json.load(line) for line in f]

# Group by example
examples = {}
for r in rollouts:
    ex = r["example_id"]
    if ex not in examples:
        examples[ex] = []
    examples[ex].append(r["reward"])

# Check variance
for ex, rewards in examples.items():
    print(f"Example {ex}: std={np.std(rewards):.3f}, rewards={rewards}")

Low variance within groups indicates rewards may need tuning.

Verify environment correctness

prime eval run my-env -m gpt-4.1-mini -n 5 -v

Manually inspect:

Reward functions give expected scores

Stop conditions trigger correctly

Tool calls execute properly

Error handling works

Training Hyperparameters

For More Aggressive Training

⚠️ Higher risk of instability/collapse:

Increase learning rate: 1e-5 → 1e-4 (LoRA), 1e-6 → 1e-5 (full)
Decrease rollouts_per_example: 16 → 8
Decrease batch_size: 512 → 256

For More Stable Training

✅ Slower progress but safer:

Increase rollouts_per_example: 8 → 16 or 32
Increase batch_size: 256 → 512 or 1024
Use larger models: 4B → 30B or 235B
Enable difficulty filtering (prime-rl)

During Training

Monitor W&B metrics:

reward/mean - Should increase steadily
reward/std - Should remain stable (not collapse to 0)
policy/entropy - Should decrease but not collapse
policy/kl - Should stay within bounds

Watch for instability:

Sudden reward drops
Loss divergence
Degenerate outputs (repetition, incoherence)

Checkpoint frequently:

[checkpointing]
save_every_n_steps = 50
keep_n_checkpoints = 10

Common Issues

OOM During Generation

Reduce rollouts_per_example
Reduce batch_size
Use LoRA instead of full finetuning
Increase tensor_parallel_size

Training Instability

Decrease learning rate
Increase rollouts_per_example (better advantage estimates)
Increase batch_size (more stable gradients)
Enable gradient clipping
Use reward clipping/normalization

Slow Training

Increase learning rate (if stable)
Use continuous rewards instead of binary
Enable online difficulty filtering
Use appropriate task difficulty
Check GPU utilization

Model Collapse

Symptoms: All outputs become identical, entropy → 0 Fixes:

Restart from earlier checkpoint
Decrease learning rate
Increase KL penalty
Increase entropy bonus
Increase rollout diversity (temperature, top_p)

Advanced Topics

Multi-Task Training

Train on multiple environments:

[[env]]
id = "gsm8k"
weight = 1.0

[[env]]
id = "math-python"
weight = 1.0

[[env]]
id = "reverse-text"
weight = 0.5

Or use EnvGroup in your environment:

import verifiers as vf

def load_environment():
    math_env = load_math_env()
    code_env = load_code_env()
    reasoning_env = load_reasoning_env()
    
    return vf.EnvGroup(
        envs=[math_env, code_env, reasoning_env],
        env_names=["math", "code", "reasoning"],
    )

Curriculum Learning

Progressively increase difficulty:

def load_environment(difficulty_level: int = 1):
    if difficulty_level == 1:
        dataset = easy_dataset
    elif difficulty_level == 2:
        dataset = medium_dataset
    else:
        dataset = hard_dataset
    
    return vf.SingleTurnEnv(dataset=dataset, rubric=rubric)

Update config between training runs:

# Start easy
prime train submit configs/rl/my-env-level1.toml

# Progress to medium
prime train submit configs/rl/my-env-level2.toml

# Final hard
prime train submit configs/rl/my-env-level3.toml

Continuous Rewards

Prefer continuous over binary rewards:

# Binary (less informative)
async def binary_reward(completion, answer) -> float:
    return 1.0 if exact_match(completion, answer) else 0.0

# Continuous (more informative)
async def continuous_reward(completion, answer) -> float:
    from difflib import SequenceMatcher
    response = parser.parse_answer(completion)
    return SequenceMatcher(None, response, answer).ratio()

Continuous rewards provide better gradient signal.

Chat Template Issues

Non-Increasing Chat Templates: Some models (Qwen3, DeepSeek-R1) remove <think> sections when processing multi-turn conversations, violating the increasing context requirement for RL.Use modified versions with fixed templates: Modified Models

Other Trainers

Verifiers environments work with multiple training frameworks:

Tinker

Tinker supports Verifiers via recipes:

git clone https://github.com/thinking-machines-lab/tinker-cookbook
cd tinker-cookbook/recipes/verifiers_rl
# Follow setup instructions

SkyRL

SkyRL integrates Verifiers:

git clone https://github.com/NovaSky-AI/SkyRL
cd SkyRL/skyrl-train/integrations/verifiers
# Follow setup instructions

rLLM

rLLM supports both verl and Tinker backends:

pip install rllm
# See documentation: https://rllm-project.readthedocs.io/en/latest/examples/verifiers/

Next Steps

Evaluation: Monitor training progress with evaluations → Evaluation Guide
Environment improvements: Iterate on reward functions and task design
Scaling: Move from small experiments to full training runs
Model selection: Experiment with different base models

Get Started

Core Concepts

Guides

Integrations

​Training Options

​Hosted Training

​Getting Started

​Configuration Reference

​Supported Models

​Environment Variables

​prime-rl: Self-Hosted Training

​Setup

​Key Features

​Configuration Options

​GEPA: Prompt Optimization

​Basic Usage

​Configuration

​Example Workflow

​GEPA Configuration Files

​RL Best Practices

​Before Training

​Training Hyperparameters

​For More Aggressive Training

​For More Stable Training

​During Training

​Common Issues

​OOM During Generation

​Training Instability

​Slow Training

​Model Collapse

​Advanced Topics

​Multi-Task Training

​Curriculum Learning

​Continuous Rewards

​Chat Template Issues

​Other Trainers

​Tinker

​SkyRL

​rLLM

​Next Steps

Build docs developers (and LLMs) love

Training Options

Hosted Training

Getting Started

Configuration Reference

Supported Models

Environment Variables

prime-rl: Self-Hosted Training

Setup

Key Features

Configuration Options

GEPA: Prompt Optimization

Basic Usage

Configuration

Example Workflow

GEPA Configuration Files

RL Best Practices

Before Training

Training Hyperparameters

For More Aggressive Training

For More Stable Training

During Training

Common Issues

OOM During Generation

Training Instability

Slow Training

Model Collapse

Advanced Topics

Multi-Task Training

Curriculum Learning

Continuous Rewards

Chat Template Issues

Other Trainers

Tinker

SkyRL

rLLM

Next Steps