prime-rl trainer is a production-ready async RL training framework that supports large-scale multi-node training, agentic rollouts with Verifiers environments, Mixture-of-Experts (MoE) models, LoRA adapters, and training algorithms including SFT and online distillation.
We recommend using prime-rl for training with Verifiers environments on self-managed GPU infrastructure.
Features
The default configuration distills best practices from our research team’s experience and the broader community into a stable, easy-to-use recipe:- Async rollout generation with continuous batching
- Online difficulty filtering to ensure training diversity
- In-flight weight updates for faster convergence
- Importance sampling and logprob clipping for stability
- Multi-node training with distributed data parallelism
- LoRA and full finetuning support
- MoE model support for efficient scaling
- SFT and online distillation in addition to RL
Setup
Install prime-rl
Set up your workspace for training with This will:
prime-rl:- Clone and install the
prime-rltrainer and its dependencies - Set up a default TOML config for training
- Configure the included
wiki-searchenvironment for 8 GPUs
Configure your training
Edit the generated config file at Key parameters:
configs/prime-rl/wiki-search.toml:model- Model to train (can be a HuggingFace model ID or local path)max_steps- Number of training stepsbatch_size- Rollouts per training batchrollouts_per_example- Multiple rollouts per dataset example for advantage estimationenv.id- Environment to train on (local or from Environments Hub)env.args- Environment-specific arguments passed toload_environment()
Training Configuration
Model Selection
Environment Configuration
Train on a single environment:weight parameter controls the sampling probability for each environment.
Sampling Configuration
Training Hyperparameters
Online Difficulty Filtering
Ensure training diversity by filtering rollout groups:Weights & Biases Integration
Multi-Node Training
For distributed training across multiple nodes:- Set up
prime-rlon each node - Configure the same training config on all nodes
- Launch with distributed settings:
Monitoring Training
Training metrics are logged to Weights & Biases:train/reward- Average reward per rollouttrain/loss- Policy gradient losstrain/learning_rate- Current learning ratetrain/kl_divergence- KL divergence from reference policyrollout/mean_length- Average rollout lengthrollout/generation_time- Time to generate rollouts
Best Practices
Before training, validate your environment with
prime eval run to ensure:- Baseline performance is > 0% (task isn’t too hard)
- Baseline performance is < 80% (task isn’t too easy)
- Rewards show diversity across rollouts
For Faster Training
- Increase
learning_rate(1e-5 to 1e-4 for LoRA) - Decrease
rollouts_per_example(4-8) - Decrease
batch_size(128-256) - Use smaller models
For More Stable Training
- Increase
rollouts_per_example(16-32) - Increase
batch_size(512-1024) - Use larger models (14B+)
- Enable online difficulty filtering
- Use KL penalty:
Common Issues
OOM During Generation
- Reduce
rollouts_per_exampleormicro_batch_size - Use LoRA instead of full finetuning
- Ensure vLLM server has sufficient memory
Training Instability
- Decrease learning rate
- Increase
rollouts_per_example - Increase
batch_size - Enable KL penalty
Slow Training
- Increase learning rate
- Use continuous rewards (not sparse binary rewards)
- Enable online difficulty filtering
- Use easier tasks or smarter models