Skip to main content
The prime-rl trainer is a production-ready async RL training framework that supports large-scale multi-node training, agentic rollouts with Verifiers environments, Mixture-of-Experts (MoE) models, LoRA adapters, and training algorithms including SFT and online distillation. We recommend using prime-rl for training with Verifiers environments on self-managed GPU infrastructure.

Features

The default configuration distills best practices from our research team’s experience and the broader community into a stable, easy-to-use recipe:
  • Async rollout generation with continuous batching
  • Online difficulty filtering to ensure training diversity
  • In-flight weight updates for faster convergence
  • Importance sampling and logprob clipping for stability
  • Multi-node training with distributed data parallelism
  • LoRA and full finetuning support
  • MoE model support for efficient scaling
  • SFT and online distillation in addition to RL

Setup

1

Install prime-rl

Set up your workspace for training with prime-rl:
prime lab setup --prime-rl
This will:
  • Clone and install the prime-rl trainer and its dependencies
  • Set up a default TOML config for training
  • Configure the included wiki-search environment for 8 GPUs
2

Configure your training

Edit the generated config file at configs/prime-rl/wiki-search.toml:
model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 500
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 512

[[env]]
id = "primeintellect/wiki-search"
args = { max_turns = 5 }

[wandb]
project = "wiki-search"
name = "qwen3-4b-wiki-search"
Key parameters:
  • model - Model to train (can be a HuggingFace model ID or local path)
  • max_steps - Number of training steps
  • batch_size - Rollouts per training batch
  • rollouts_per_example - Multiple rollouts per dataset example for advantage estimation
  • env.id - Environment to train on (local or from Environments Hub)
  • env.args - Environment-specific arguments passed to load_environment()
3

Start training

Launch training with:
uv run prime-rl configs/prime-rl/wiki-search.toml
This launches a tmux session with separate panes for:
  • Trainer - Handles gradient updates and optimization
  • Orchestrator - Manages rollout generation and batching
  • Inference server - Serves the model for rollout generation

Training Configuration

Model Selection

model = "Qwen/Qwen3-4B-Instruct-2507"
Supports any HuggingFace model or local checkpoint. For LoRA training:
[lora]
enabled = true
r = 64
alpha = 16
dropout = 0.05
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj"]

Environment Configuration

Train on a single environment:
[[env]]
id = "primeintellect/math-python"
args = { max_turns = 10, difficulty = "hard" }
Or train on multiple environments:
[[env]]
id = "primeintellect/math-python"
args = { max_turns = 10 }
weight = 0.5

[[env]]
id = "primeintellect/gsm8k"
args = { max_turns = 5 }
weight = 0.3

[[env]]
id = "primeintellect/wiki-search"
args = { max_turns = 8 }
weight = 0.2
The weight parameter controls the sampling probability for each environment.

Sampling Configuration

[sampling]
max_tokens = 512
temperature = 0.7
top_p = 0.9
stop = ["<|endoftext|>"]

Training Hyperparameters

learning_rate = 1e-5  # 1e-5 for LoRA, 1e-6 for full finetuning
max_steps = 1000
batch_size = 256
rollouts_per_example = 8
gradient_accumulation_steps = 1
warmup_steps = 100

Online Difficulty Filtering

Ensure training diversity by filtering rollout groups:
[difficulty_filter]
enabled = true
min_reward_variance = 0.1  # Require some diversity in rewards
max_reward = 0.95  # Skip groups that are too easy
min_reward = 0.05  # Skip groups that are too hard

Weights & Biases Integration

[wandb]
project = "my-training-runs"
name = "qwen3-4b-math-rl"
entity = "my-team"  # optional

Multi-Node Training

For distributed training across multiple nodes:
  1. Set up prime-rl on each node
  2. Configure the same training config on all nodes
  3. Launch with distributed settings:
# Node 0 (master)
export MASTER_ADDR=node0.example.com
export MASTER_PORT=29500
export WORLD_SIZE=4
export RANK=0
uv run prime-rl configs/prime-rl/wiki-search.toml

# Node 1
export MASTER_ADDR=node0.example.com
export MASTER_PORT=29500
export WORLD_SIZE=4
export RANK=1
uv run prime-rl configs/prime-rl/wiki-search.toml

# ... and so on for nodes 2-3

Monitoring Training

Training metrics are logged to Weights & Biases:
  • train/reward - Average reward per rollout
  • train/loss - Policy gradient loss
  • train/learning_rate - Current learning rate
  • train/kl_divergence - KL divergence from reference policy
  • rollout/mean_length - Average rollout length
  • rollout/generation_time - Time to generate rollouts

Best Practices

Before training, validate your environment with prime eval run to ensure:
  • Baseline performance is > 0% (task isn’t too hard)
  • Baseline performance is < 80% (task isn’t too easy)
  • Rewards show diversity across rollouts

For Faster Training

  • Increase learning_rate (1e-5 to 1e-4 for LoRA)
  • Decrease rollouts_per_example (4-8)
  • Decrease batch_size (128-256)
  • Use smaller models

For More Stable Training

  • Increase rollouts_per_example (16-32)
  • Increase batch_size (512-1024)
  • Use larger models (14B+)
  • Enable online difficulty filtering
  • Use KL penalty:
[kl_penalty]
enabled = true
target_kl = 0.01
beta = 0.1

Common Issues

OOM During Generation

  • Reduce rollouts_per_example or micro_batch_size
  • Use LoRA instead of full finetuning
  • Ensure vLLM server has sufficient memory

Training Instability

  • Decrease learning rate
  • Increase rollouts_per_example
  • Increase batch_size
  • Enable KL penalty

Slow Training

  • Increase learning rate
  • Use continuous rewards (not sparse binary rewards)
  • Enable online difficulty filtering
  • Use easier tasks or smarter models

Further Documentation

For advanced configuration options and troubleshooting, see the prime-rl documentation.

Build docs developers (and LLMs) love