prime-rl trainer, and prompt optimization with GEPA.
Training Options
Three primary approaches:| Method | Best For | Infrastructure |
|---|---|---|
| Hosted Training | Production training, no GPU management | Managed by Prime Intellect |
| prime-rl | Self-hosted, large-scale training | Your GPU cluster |
| GEPA | Prompt optimization (no gradient training) | CPU/single GPU |
Hosted Training
Hosted Training provides fully managed RL training infrastructure. You provide an environment and config, we handle the rest.Getting Started
gsm8k.toml - Math reasoningmath-python.toml - Code-based mathwordle.toml - Game playingwiki-search.toml - Tool usemodel = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 500
batch_size = 256
rollouts_per_example = 8
[sampling]
max_tokens = 512
[[env]]
id = "my-environment"
args = { difficulty = "medium" }
[wandb]
project = "my-project"
name = "my-run"
Configuration Reference
Supported Models
Hosted Training currently supports:Qwen/Qwen3-4B-Instruct-2507Qwen/Qwen3-4B-Thinking-2507Qwen/Qwen3-30B-Instruct-2507Qwen/Qwen3-30B-Thinking-2507Qwen/Qwen3-235B-Instruct-2507Qwen/Qwen3-235B-Thinking-2507PrimeIntellect/INTELLECT-3
Hosted Training is currently in Private Beta. Request access.
Environment Variables
For environments requiring API keys (e.g., judge models):- Create a secrets file:
secrets.env
- Reference in config:
- Or set via Lab UI when submitting the job
prime-rl: Self-Hosted Training
prime-rl is our production-ready async RL trainer for self-managed GPU infrastructure.
Setup
# Model and infrastructure
model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
num_gpus = 8
tensor_parallel_size = 4
# Training parameters
max_steps = 1000
batch_size = 512
rollouts_per_example = 16
learning_rate = 1e-5
# Environment
[env]
id = "wiki-search"
args = { max_turns = 10, judge_model = "gpt-4.1-mini" }
# Sampling
[sampling]
max_tokens = 1024
temperature = 1.0
# W&B
[wandb]
project = "wiki-search"
name = "qwen3-30b-wiki"
Key Features
- Async rollout generation: Non-blocking inference for maximum throughput
- Continuous batching: Efficient GPU utilization
- In-flight weight updates: Models update during rollout generation
- Online difficulty filtering: Focus on appropriately challenging examples
- LoRA support: Efficient fine-tuning for large models
- MoE support: Mixture-of-Experts architectures
Configuration Options
GEPA: Prompt Optimization
GEPA (Genetic-Pareto) optimizes system prompts without gradient-based training, using a teacher LLM to iteratively improve prompts based on evaluation results.Basic Usage
- Runs initial evaluation with current prompt
- Uses teacher model to propose improvements
- Evaluates new prompts
- Selects best prompts (Pareto frontier)
- Repeats until budget exhausted
Configuration
| Flag | Description | Default |
|---|---|---|
--model / -m | Model for rollouts | Required |
--reflection-model / -M | Teacher model for prompt refinement | Same as --model |
--max-calls / -B | Evaluation budget | 500 |
--num-train / -n | Training examples | 100 |
--num-val / -N | Validation examples | 50 |
--minibatch-size | Examples per reflection | 3 |
--perfect-score | Max reward (skip if achieved) | None |
--state-columns | Extra state fields for reflection | None |
Example Workflow
prime gepa run wordle \
--model google/gemini-3-flash-preview \
--reflection-model google/gemini-3-exp-ultra-preview \
--max-calls 1000 \
--num-train 200 \
--num-val 100
best_prompt.txt - Optimized system promptpareto_frontier.jsonl - Best prompts per validation examplemetadata.json - Run configuration and summaryDEFAULT_SYSTEM_PROMPT = """<content from best_prompt.txt>"""
def load_environment(
system_prompt: str = DEFAULT_SYSTEM_PROMPT,
**kwargs
):
return vf.SingleTurnEnv(
dataset=dataset,
system_prompt=system_prompt,
rubric=rubric,
)
GEPA Configuration Files
Use TOML configs for reproducible optimization:configs/gepa/my-optimization.toml
RL Best Practices
Before Training
import json
import numpy as np
with open("results.jsonl") as f:
rollouts = [json.load(line) for line in f]
# Group by example
examples = {}
for r in rollouts:
ex = r["example_id"]
if ex not in examples:
examples[ex] = []
examples[ex].append(r["reward"])
# Check variance
for ex, rewards in examples.items():
print(f"Example {ex}: std={np.std(rewards):.3f}, rewards={rewards}")
Training Hyperparameters
For More Aggressive Training
⚠️ Higher risk of instability/collapse:- Increase learning rate:
1e-5→1e-4(LoRA),1e-6→1e-5(full) - Decrease
rollouts_per_example:16→8 - Decrease
batch_size:512→256
For More Stable Training
✅ Slower progress but safer:- Increase
rollouts_per_example:8→16or32 - Increase
batch_size:256→512or1024 - Use larger models:
4B→30Bor235B - Enable difficulty filtering (prime-rl)
During Training
Monitor W&B metrics:reward/mean- Should increase steadilyreward/std- Should remain stable (not collapse to 0)policy/entropy- Should decrease but not collapsepolicy/kl- Should stay within bounds
- Sudden reward drops
- Loss divergence
- Degenerate outputs (repetition, incoherence)
Common Issues
OOM During Generation
- Reduce
rollouts_per_example - Reduce
batch_size - Use LoRA instead of full finetuning
- Increase
tensor_parallel_size
Training Instability
- Decrease learning rate
- Increase
rollouts_per_example(better advantage estimates) - Increase
batch_size(more stable gradients) - Enable gradient clipping
- Use reward clipping/normalization
Slow Training
- Increase learning rate (if stable)
- Use continuous rewards instead of binary
- Enable online difficulty filtering
- Use appropriate task difficulty
- Check GPU utilization
Model Collapse
Symptoms: All outputs become identical, entropy → 0 Fixes:- Restart from earlier checkpoint
- Decrease learning rate
- Increase KL penalty
- Increase entropy bonus
- Increase rollout diversity (temperature, top_p)
Advanced Topics
Multi-Task Training
Train on multiple environments:EnvGroup in your environment:
Curriculum Learning
Progressively increase difficulty:Continuous Rewards
Prefer continuous over binary rewards:Chat Template Issues
Other Trainers
Verifiers environments work with multiple training frameworks:Tinker
Tinker supports Verifiers via recipes:SkyRL
SkyRL integrates Verifiers:rLLM
rLLM supports both verl and Tinker backends:Next Steps
- Evaluation: Monitor training progress with evaluations → Evaluation Guide
- Environment improvements: Iterate on reward functions and task design
- Scaling: Move from small experiments to full training runs
- Model selection: Experiment with different base models