Prime-RL Integration

The prime-rl trainer is a production-ready async RL training framework that supports large-scale multi-node training, agentic rollouts with Verifiers environments, Mixture-of-Experts (MoE) models, LoRA adapters, and training algorithms including SFT and online distillation. We recommend using prime-rl for training with Verifiers environments on self-managed GPU infrastructure.

Features

The default configuration distills best practices from our research team’s experience and the broader community into a stable, easy-to-use recipe:

Async rollout generation with continuous batching
Online difficulty filtering to ensure training diversity
In-flight weight updates for faster convergence
Importance sampling and logprob clipping for stability
Multi-node training with distributed data parallelism
LoRA and full finetuning support
MoE model support for efficient scaling
SFT and online distillation in addition to RL

Setup

Install prime-rl

Set up your workspace for training with prime-rl:

prime lab setup --prime-rl

This will:

Clone and install the prime-rl trainer and its dependencies
Set up a default TOML config for training
Configure the included wiki-search environment for 8 GPUs

Configure your training

Edit the generated config file at configs/prime-rl/wiki-search.toml:

model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 500
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 512

[[env]]
id = "primeintellect/wiki-search"
args = { max_turns = 5 }

[wandb]
project = "wiki-search"
name = "qwen3-4b-wiki-search"

Key parameters:

model - Model to train (can be a HuggingFace model ID or local path)
max_steps - Number of training steps
batch_size - Rollouts per training batch
rollouts_per_example - Multiple rollouts per dataset example for advantage estimation
env.id - Environment to train on (local or from Environments Hub)
env.args - Environment-specific arguments passed to load_environment()

Start training

Launch training with:

uv run prime-rl configs/prime-rl/wiki-search.toml

This launches a tmux session with separate panes for:

Trainer - Handles gradient updates and optimization
Orchestrator - Manages rollout generation and batching
Inference server - Serves the model for rollout generation

Training Configuration

Model Selection

model = "Qwen/Qwen3-4B-Instruct-2507"

Supports any HuggingFace model or local checkpoint. For LoRA training:

[lora]
enabled = true
r = 64
alpha = 16
dropout = 0.05
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj"]

Environment Configuration

Train on a single environment:

[[env]]
id = "primeintellect/math-python"
args = { max_turns = 10, difficulty = "hard" }

Or train on multiple environments:

[[env]]
id = "primeintellect/math-python"
args = { max_turns = 10 }
weight = 0.5

[[env]]
id = "primeintellect/gsm8k"
args = { max_turns = 5 }
weight = 0.3

[[env]]
id = "primeintellect/wiki-search"
args = { max_turns = 8 }
weight = 0.2

The weight parameter controls the sampling probability for each environment.

Sampling Configuration

[sampling]
max_tokens = 512
temperature = 0.7
top_p = 0.9
stop = ["<|endoftext|>"]

Training Hyperparameters

learning_rate = 1e-5  # 1e-5 for LoRA, 1e-6 for full finetuning
max_steps = 1000
batch_size = 256
rollouts_per_example = 8
gradient_accumulation_steps = 1
warmup_steps = 100

Online Difficulty Filtering

Ensure training diversity by filtering rollout groups:

[difficulty_filter]
enabled = true
min_reward_variance = 0.1  # Require some diversity in rewards
max_reward = 0.95  # Skip groups that are too easy
min_reward = 0.05  # Skip groups that are too hard

Weights & Biases Integration

[wandb]
project = "my-training-runs"
name = "qwen3-4b-math-rl"
entity = "my-team"  # optional

Multi-Node Training

For distributed training across multiple nodes:

Set up prime-rl on each node
Configure the same training config on all nodes
Launch with distributed settings:

# Node 0 (master)
export MASTER_ADDR=node0.example.com
export MASTER_PORT=29500
export WORLD_SIZE=4
export RANK=0
uv run prime-rl configs/prime-rl/wiki-search.toml

# Node 1
export MASTER_ADDR=node0.example.com
export MASTER_PORT=29500
export WORLD_SIZE=4
export RANK=1
uv run prime-rl configs/prime-rl/wiki-search.toml

# ... and so on for nodes 2-3

Monitoring Training

Training metrics are logged to Weights & Biases:

train/reward - Average reward per rollout
train/loss - Policy gradient loss
train/learning_rate - Current learning rate
train/kl_divergence - KL divergence from reference policy
rollout/mean_length - Average rollout length
rollout/generation_time - Time to generate rollouts

Best Practices

Before training, validate your environment with prime eval run to ensure:

Baseline performance is > 0% (task isn’t too hard)
Baseline performance is < 80% (task isn’t too easy)
Rewards show diversity across rollouts

For Faster Training

Increase learning_rate (1e-5 to 1e-4 for LoRA)
Decrease rollouts_per_example (4-8)
Decrease batch_size (128-256)
Use smaller models

For More Stable Training

Increase rollouts_per_example (16-32)
Increase batch_size (512-1024)
Use larger models (14B+)
Enable online difficulty filtering
Use KL penalty:

[kl_penalty]
enabled = true
target_kl = 0.01
beta = 0.1

Common Issues

OOM During Generation

Reduce rollouts_per_example or micro_batch_size
Use LoRA instead of full finetuning
Ensure vLLM server has sufficient memory

Training Instability

Decrease learning rate
Increase rollouts_per_example
Increase batch_size
Enable KL penalty

Slow Training

Increase learning rate
Use continuous rewards (not sparse binary rewards)
Enable online difficulty filtering
Use easier tasks or smarter models

Further Documentation

For advanced configuration options and troubleshooting, see the prime-rl documentation.

Get Started

Core Concepts

Guides

Integrations

Prime-RL Integration

Features

Setup

Training Configuration

Model Selection

Environment Configuration

Sampling Configuration

Training Hyperparameters

Online Difficulty Filtering

Weights & Biases Integration

Multi-Node Training

Monitoring Training

Best Practices

For Faster Training

For More Stable Training

Common Issues

OOM During Generation

Training Instability

Slow Training

Further Documentation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

​Features

​Setup

​Training Configuration

​Model Selection

​Environment Configuration

​Sampling Configuration

​Training Hyperparameters

​Online Difficulty Filtering

​Weights & Biases Integration

​Multi-Node Training

​Monitoring Training

​Best Practices

​For Faster Training

​For More Stable Training

​Common Issues

​OOM During Generation

​Training Instability

​Slow Training

​Further Documentation

Build docs developers (and LLMs) love

Features

Setup

Training Configuration

Model Selection

Environment Configuration

Sampling Configuration

Training Hyperparameters

Online Difficulty Filtering

Weights & Biases Integration

Multi-Node Training

Monitoring Training

Best Practices

For Faster Training

For More Stable Training

Common Issues

OOM During Generation

Training Instability

Slow Training

Further Documentation