Skip to main content
Verifiers environments are designed for reinforcement learning training. This guide covers training with Hosted Training (recommended), the open-source prime-rl trainer, and prompt optimization with GEPA.

Training Options

Three primary approaches:
MethodBest ForInfrastructure
Hosted TrainingProduction training, no GPU managementManaged by Prime Intellect
prime-rlSelf-hosted, large-scale trainingYour GPU cluster
GEPAPrompt optimization (no gradient training)CPU/single GPU

Hosted Training

Hosted Training provides fully managed RL training infrastructure. You provide an environment and config, we handle the rest.

Getting Started

1
Setup workspace
2
prime lab setup
3
This downloads example configs to configs/rl/.
4
Choose a base config
5
Example configs:
6
  • gsm8k.toml - Math reasoning
  • math-python.toml - Code-based math
  • wordle.toml - Game playing
  • wiki-search.toml - Tool use
  • 7
    Configure your training run
    8
    Edit or create a config:
    9
    model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
    max_steps = 500
    batch_size = 256
    rollouts_per_example = 8
    
    [sampling]
    max_tokens = 512
    
    [[env]]
    id = "my-environment"
    args = { difficulty = "medium" }
    
    [wandb]
    project = "my-project"
    name = "my-run"
    
    10
    Submit training job
    11
    Submit via the Prime Lab UI or CLI:
    12
    prime train submit configs/rl/my-training.toml
    

    Configuration Reference

    # Model and training
    model = "Qwen/Qwen3-30B-A3B-Instruct-2507"  # Base model
    max_steps = 500                             # Training steps
    batch_size = 256                            # Samples per gradient update
    rollouts_per_example = 8                    # Rollouts per example for advantage
    
    # Sampling parameters
    [sampling]
    max_tokens = 512
    temperature = 1.0
    
    # Environment configuration
    [[env]]
    id = "primeintellect/alphabet-sort"
    args = { min_turns = 3, max_turns = 5 }
    
    # W&B logging
    [wandb]
    project = "alphabet-sort"
    name = "qwen3-30b-alphabet-sort"
    
    # Optional: environment variables for API keys
    env_file = ["secrets.env"]
    

    Supported Models

    Hosted Training currently supports:
    • Qwen/Qwen3-4B-Instruct-2507
    • Qwen/Qwen3-4B-Thinking-2507
    • Qwen/Qwen3-30B-Instruct-2507
    • Qwen/Qwen3-30B-Thinking-2507
    • Qwen/Qwen3-235B-Instruct-2507
    • Qwen/Qwen3-235B-Thinking-2507
    • PrimeIntellect/INTELLECT-3
    Hosted Training is currently in Private Beta. Request access.

    Environment Variables

    For environments requiring API keys (e.g., judge models):
    1. Create a secrets file:
    secrets.env
    OPENAI_API_KEY=sk-...
    ANTHROPIC_API_KEY=sk-ant-...
    
    1. Reference in config:
    env_file = ["secrets.env"]
    
    1. Or set via Lab UI when submitting the job

    prime-rl: Self-Hosted Training

    prime-rl is our production-ready async RL trainer for self-managed GPU infrastructure.

    Setup

    1
    Clone and install
    2
    prime lab setup --prime-rl
    
    3
    This:
    4
  • Clones the prime-rl repository
  • Installs dependencies
  • Sets up example configs in configs/prime-rl/
  • 5
    Configure training
    6
    Example config:
    7
    # Model and infrastructure
    model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
    num_gpus = 8
    tensor_parallel_size = 4
    
    # Training parameters
    max_steps = 1000
    batch_size = 512
    rollouts_per_example = 16
    learning_rate = 1e-5
    
    # Environment
    [env]
    id = "wiki-search"
    args = { max_turns = 10, judge_model = "gpt-4.1-mini" }
    
    # Sampling
    [sampling]
    max_tokens = 1024
    temperature = 1.0
    
    # W&B
    [wandb]
    project = "wiki-search"
    name = "qwen3-30b-wiki"
    
    8
    Launch training
    9
    uv run prime-rl configs/prime-rl/my-training.toml
    
    10
    This launches a tmux session with:
    11
  • Inference server (vLLM)
  • Orchestrator (coordinates training)
  • Trainer (updates model weights)
  • Key Features

    • Async rollout generation: Non-blocking inference for maximum throughput
    • Continuous batching: Efficient GPU utilization
    • In-flight weight updates: Models update during rollout generation
    • Online difficulty filtering: Focus on appropriately challenging examples
    • LoRA support: Efficient fine-tuning for large models
    • MoE support: Mixture-of-Experts architectures

    Configuration Options

    # Core training
    max_steps = 1000
    batch_size = 512
    rollouts_per_example = 16
    learning_rate = 1e-5
    
    # LoRA (optional)
    use_lora = true
    lora_rank = 64
    lora_alpha = 128
    
    # Difficulty filtering (optional)
    use_difficulty_filtering = true
    difficulty_threshold = 0.3  # Min reward variance required
    
    # Sampling
    [sampling]
    max_tokens = 1024
    temperature = 1.0
    top_p = 0.9
    
    # Infrastructure
    num_gpus = 8
    tensor_parallel_size = 4
    pipeline_parallel_size = 1
    
    For full documentation: prime-rl docs

    GEPA: Prompt Optimization

    GEPA (Genetic-Pareto) optimizes system prompts without gradient-based training, using a teacher LLM to iteratively improve prompts based on evaluation results.

    Basic Usage

    prime gepa run my-env --model google/gemini-3-flash-preview
    
    This:
    1. Runs initial evaluation with current prompt
    2. Uses teacher model to propose improvements
    3. Evaluates new prompts
    4. Selects best prompts (Pareto frontier)
    5. Repeats until budget exhausted

    Configuration

    FlagDescriptionDefault
    --model / -mModel for rolloutsRequired
    --reflection-model / -MTeacher model for prompt refinementSame as --model
    --max-calls / -BEvaluation budget500
    --num-train / -nTraining examples100
    --num-val / -NValidation examples50
    --minibatch-sizeExamples per reflection3
    --perfect-scoreMax reward (skip if achieved)None
    --state-columnsExtra state fields for reflectionNone

    Example Workflow

    1
    Run optimization
    2
    prime gepa run wordle \
      --model google/gemini-3-flash-preview \
      --reflection-model google/gemini-3-exp-ultra-preview \
      --max-calls 1000 \
      --num-train 200 \
      --num-val 100
    
    3
    Check output
    4
    Results saved to environments/wordle/outputs/gepa/:
    5
  • best_prompt.txt - Optimized system prompt
  • pareto_frontier.jsonl - Best prompts per validation example
  • metadata.json - Run configuration and summary
  • 6
    Use the optimized prompt
    7
    Copy the best prompt to your environment:
    8
    DEFAULT_SYSTEM_PROMPT = """<content from best_prompt.txt>"""
    
    def load_environment(
        system_prompt: str = DEFAULT_SYSTEM_PROMPT,
        **kwargs
    ):
        return vf.SingleTurnEnv(
            dataset=dataset,
            system_prompt=system_prompt,
            rubric=rubric,
        )
    
    9
    Verify improvement
    10
    # Before optimization
    prime eval run wordle -m google/gemini-3-flash-preview -n 100
    
    # After optimization
    prime eval run wordle \
      -m google/gemini-3-flash-preview \
      -n 100 \
      -a '{"system_prompt": "<optimized prompt>"}'
    

    GEPA Configuration Files

    Use TOML configs for reproducible optimization:
    configs/gepa/my-optimization.toml
    env_id = "my-env"
    model = "google/gemini-3-flash-preview"
    reflection_model = "google/gemini-3-exp-ultra-preview"
    max_calls = 1000
    num_train = 200
    num_val = 100
    minibatch_size = 5
    perfect_score = 1.0
    state_columns = ["parsed_answer", "tool_calls"]
    
    Run:
    prime gepa run configs/gepa/my-optimization.toml
    

    RL Best Practices

    Before Training

    1
    Evaluate baseline performance
    2
    Run evaluation to establish baseline:
    3
    prime eval run my-env -m base-model -n 100 -r 5
    
    4
    Target baselines:
    5
  • Too easy: >80% success → task may be too simple
  • Good range: 10-70% success → ideal for RL
  • Too hard: Less than 5% success → model may need stronger base
  • 6
    Check reward diversity
    7
    Ensure varied rewards within groups:
    8
    prime eval run my-env -m base-model -n 20 -r 8 -s
    
    9
    Analyze results:
    10
    import json
    import numpy as np
    
    with open("results.jsonl") as f:
        rollouts = [json.load(line) for line in f]
    
    # Group by example
    examples = {}
    for r in rollouts:
        ex = r["example_id"]
        if ex not in examples:
            examples[ex] = []
        examples[ex].append(r["reward"])
    
    # Check variance
    for ex, rewards in examples.items():
        print(f"Example {ex}: std={np.std(rewards):.3f}, rewards={rewards}")
    
    11
    Low variance within groups indicates rewards may need tuning.
    12
    Verify environment correctness
    13
    prime eval run my-env -m gpt-4.1-mini -n 5 -v
    
    14
    Manually inspect:
    15
  • Reward functions give expected scores
  • Stop conditions trigger correctly
  • Tool calls execute properly
  • Error handling works
  • Training Hyperparameters

    For More Aggressive Training

    ⚠️ Higher risk of instability/collapse:
    • Increase learning rate: 1e-51e-4 (LoRA), 1e-61e-5 (full)
    • Decrease rollouts_per_example: 168
    • Decrease batch_size: 512256

    For More Stable Training

    ✅ Slower progress but safer:
    • Increase rollouts_per_example: 816 or 32
    • Increase batch_size: 256512 or 1024
    • Use larger models: 4B30B or 235B
    • Enable difficulty filtering (prime-rl)

    During Training

    Monitor W&B metrics:
    • reward/mean - Should increase steadily
    • reward/std - Should remain stable (not collapse to 0)
    • policy/entropy - Should decrease but not collapse
    • policy/kl - Should stay within bounds
    Watch for instability:
    • Sudden reward drops
    • Loss divergence
    • Degenerate outputs (repetition, incoherence)
    Checkpoint frequently:
    [checkpointing]
    save_every_n_steps = 50
    keep_n_checkpoints = 10
    

    Common Issues

    OOM During Generation

    • Reduce rollouts_per_example
    • Reduce batch_size
    • Use LoRA instead of full finetuning
    • Increase tensor_parallel_size

    Training Instability

    • Decrease learning rate
    • Increase rollouts_per_example (better advantage estimates)
    • Increase batch_size (more stable gradients)
    • Enable gradient clipping
    • Use reward clipping/normalization

    Slow Training

    • Increase learning rate (if stable)
    • Use continuous rewards instead of binary
    • Enable online difficulty filtering
    • Use appropriate task difficulty
    • Check GPU utilization

    Model Collapse

    Symptoms: All outputs become identical, entropy → 0 Fixes:
    • Restart from earlier checkpoint
    • Decrease learning rate
    • Increase KL penalty
    • Increase entropy bonus
    • Increase rollout diversity (temperature, top_p)

    Advanced Topics

    Multi-Task Training

    Train on multiple environments:
    [[env]]
    id = "gsm8k"
    weight = 1.0
    
    [[env]]
    id = "math-python"
    weight = 1.0
    
    [[env]]
    id = "reverse-text"
    weight = 0.5
    
    Or use EnvGroup in your environment:
    import verifiers as vf
    
    def load_environment():
        math_env = load_math_env()
        code_env = load_code_env()
        reasoning_env = load_reasoning_env()
        
        return vf.EnvGroup(
            envs=[math_env, code_env, reasoning_env],
            env_names=["math", "code", "reasoning"],
        )
    

    Curriculum Learning

    Progressively increase difficulty:
    def load_environment(difficulty_level: int = 1):
        if difficulty_level == 1:
            dataset = easy_dataset
        elif difficulty_level == 2:
            dataset = medium_dataset
        else:
            dataset = hard_dataset
        
        return vf.SingleTurnEnv(dataset=dataset, rubric=rubric)
    
    Update config between training runs:
    # Start easy
    prime train submit configs/rl/my-env-level1.toml
    
    # Progress to medium
    prime train submit configs/rl/my-env-level2.toml
    
    # Final hard
    prime train submit configs/rl/my-env-level3.toml
    

    Continuous Rewards

    Prefer continuous over binary rewards:
    # Binary (less informative)
    async def binary_reward(completion, answer) -> float:
        return 1.0 if exact_match(completion, answer) else 0.0
    
    # Continuous (more informative)
    async def continuous_reward(completion, answer) -> float:
        from difflib import SequenceMatcher
        response = parser.parse_answer(completion)
        return SequenceMatcher(None, response, answer).ratio()
    
    Continuous rewards provide better gradient signal.

    Chat Template Issues

    Non-Increasing Chat Templates: Some models (Qwen3, DeepSeek-R1) remove <think> sections when processing multi-turn conversations, violating the increasing context requirement for RL.Use modified versions with fixed templates: Modified Models

    Other Trainers

    Verifiers environments work with multiple training frameworks:

    Tinker

    Tinker supports Verifiers via recipes:
    git clone https://github.com/thinking-machines-lab/tinker-cookbook
    cd tinker-cookbook/recipes/verifiers_rl
    # Follow setup instructions
    

    SkyRL

    SkyRL integrates Verifiers:
    git clone https://github.com/NovaSky-AI/SkyRL
    cd SkyRL/skyrl-train/integrations/verifiers
    # Follow setup instructions
    

    rLLM

    rLLM supports both verl and Tinker backends:
    pip install rllm
    # See documentation: https://rllm-project.readthedocs.io/en/latest/examples/verifiers/
    

    Next Steps

    • Evaluation: Monitor training progress with evaluations → Evaluation Guide
    • Environment improvements: Iterate on reward functions and task design
    • Scaling: Move from small experiments to full training runs
    • Model selection: Experiment with different base models

    Build docs developers (and LLMs) love