rLLM provides a high-level AgentTrainer class that orchestrates the complete RL training loop. It integrates trajectory generation (via execution engines) with policy optimization (via verl ) and distributed execution (via Ray ).
Overview
The AgentTrainer simplifies RL training by providing:
Simple API : Specify agent, environment, and datasets - then call train()
Multiple backends : Supports verl (default), Fireworks, and Tinker
Distributed training : Automatic Ray cluster management
Flexible configuration : Hydra-based config system
Algorithm support : PPO, GRPO, ReMax, and more
Source code : rllm/trainer/agent_trainer.py:7
The Training Loop
rLLM implements the standard online RL loop:
Trajectory Generation
The execution engine generates a batch of trajectories using the current agent policy
Advantage Computation
Advantages are computed from rewards (GRPO, PPO, etc.)
Policy Update
verl trainer updates the model weights using the trajectories
Iteration
New batch is generated with the updated model, cycle repeats
Basic Usage
Minimal Training Script
import hydra
from omegaconf import DictConfig
from rllm.trainer import AgentTrainer
from rllm.data import DatasetRegistry
from rllm.agents import ToolAgent
from rllm.environments import ToolEnvironment
from rllm.rewards import math_reward_fn
@hydra.main ( config_path = "pkg://rllm.trainer.config" , config_name = "ppo_trainer" , version_base = None )
def main ( config : DictConfig):
# Load datasets
train_dataset = DatasetRegistry.load_dataset( "math" , "train" )
val_dataset = DatasetRegistry.load_dataset( "math" , "test" )
# Configure agent and environment
agent_args = {
"tools" : [ "python" ],
"parser_name" : "qwen" ,
"system_prompt" : "Solve the math problem step by step."
}
env_args = {
"tools" : [ "python" ],
"reward_fn" : math_reward_fn,
"max_turns" : 5
}
# Create trainer
trainer = AgentTrainer(
agent_class = ToolAgent,
env_class = ToolEnvironment,
agent_args = agent_args,
env_args = env_args,
config = config,
train_dataset = train_dataset,
val_dataset = val_dataset,
backend = "verl" # default
)
# Train!
trainer.train()
if __name__ == "__main__" :
main()
Workflow-Based Training
For complex multi-agent scenarios:
from rllm.trainer import AgentTrainer
from rllm.workflows import SolverJudgeWorkflow
@hydra.main ( config_path = "pkg://rllm.trainer.config" , config_name = "ppo_trainer" , version_base = None )
def main ( config : DictConfig):
train_dataset = DatasetRegistry.load_dataset( "math" , "train" )
val_dataset = DatasetRegistry.load_dataset( "math" , "test" )
workflow_args = {
"n_solutions" : 4 ,
"reward_function" : math_reward_fn,
"solver_agent_cls" : MathAgent,
"judge_agent_cls" : JudgeAgent,
}
trainer = AgentTrainer(
workflow_class = SolverJudgeWorkflow,
workflow_args = workflow_args,
config = config,
train_dataset = train_dataset,
val_dataset = val_dataset,
)
trainer.train()
Source code : rllm/trainer/agent_trainer.py:17-89
Configuration System
rLLM uses Hydra for configuration management. Configs are located in rllm/trainer/config/.
Loading Configs
@hydra.main ( config_path = "pkg://rllm.trainer.config" , config_name = "ppo_trainer" , version_base = None )
def main ( config : DictConfig):
# config is automatically loaded and merged
pass
Overriding Configs
You can override configs in three ways:
Command Line
Config File
Programmatic
python train.py \
data.train_batch_size= 64 \
trainer.total_epochs= 10 \
actor_rollout_ref.model.path="Qwen/Qwen3-4B"
Create config.yaml: data :
train_batch_size : 64
max_prompt_length : 2048
max_response_length : 1024
trainer :
total_epochs : 10
save_freq : 100
Then: python train.py --config-path=/path/to/config
from omegaconf import OmegaConf
# Load base config
config = OmegaConf.load( "config.yaml" )
# Override specific values
config.data.train_batch_size = 64
config.trainer.total_epochs = 10
# Pass to trainer
trainer = AgentTrainer(
agent_class = MyAgent,
env_class = MyEnv,
config = config,
...
)
Key Configuration Sections
data - Dataset configuration
data :
train_batch_size : 256 # Batch size for training
val_batch_size : 256 # Batch size for validation
max_prompt_length : 2048 # Max prompt tokens
max_response_length : 1024 # Max response tokens
train_files : null # Auto-set from train_dataset
val_files : null # Auto-set from val_dataset
trainer - Training parameters
trainer :
total_epochs : 15 # Total training epochs
save_freq : 100 # Save checkpoint every N steps
test_freq : 100 # Run validation every N steps
project_name : "rllm_training" # W&B project name
experiment_name : "math_agent" # Experiment name
logger : "wandb" # Logger backend (wandb/tensorboard)
val_before_train : true # Validate before training starts
actor_rollout_ref - Model and rollout config
actor_rollout_ref :
model :
path : "Qwen/Qwen3-4B" # Model path/name
enable_gradient_checkpointing : true
rollout :
mode : "async" # Rollout mode (async/sync)
n : 8 # Rollouts per task
temperature : 0.6 # Sampling temperature
log_prob_micro_batch_size : 64 # Batch size for logprob computation
hybrid_engine : true # Use hybrid engine (vLLM + PyTorch)
algorithm - RL algorithm settings
algorithm :
advantage :
kl_ctrl :
type : "grpo" # Advantage type (grpo/ppo/remax)
coeff : 0.05 # KL coefficient
ppo_mini_batch_size : 256 # PPO mini-batch size
ppo_epochs : 1 # PPO epochs per update
entropy_coeff : 0.0 # Entropy bonus coefficient
clip_ratio : 0.2 # PPO clip ratio
rllm - rLLM-specific settings
rllm :
agent :
max_steps : 10 # Max steps per trajectory
trajectory_timeout : 300 # Timeout per trajectory (seconds)
engine_args :
n_parallel_agents : 256 # Parallel agent-env pairs
stepwise_advantage :
enable : false # Use step-level advantages
compact_filtering :
enable : true # Filter invalid trajectories
mask_timeout : true # Mask timeout trajectories
mask_error : true # Mask error trajectories
workflow :
use_workflow : false # Use AgentWorkflowEngine
Training Backends
rLLM supports multiple training backends:
verl (Default)
Best for most use cases. Supports both agent-env and workflow-based training:
trainer = AgentTrainer(
agent_class = MyAgent,
env_class = MyEnv,
config = config,
train_dataset = train_dataset,
val_dataset = val_dataset,
backend = "verl" # default
)
trainer.train()
Features :
Full PPO/GRPO support
Distributed training via Ray
Hybrid engine (vLLM + PyTorch)
Advanced advantage computation
Source code : rllm/trainer/agent_trainer.py:123-155
Fireworks
Optimized for Fireworks API with workflow-based training:
trainer = AgentTrainer(
workflow_class = MyWorkflow,
workflow_args = { ... },
config = config,
train_dataset = train_dataset,
val_dataset = val_dataset,
backend = "fireworks"
)
trainer.train()
Note : Fireworks backend only supports workflow-based training, not agent_class/env_class.
Source code : rllm/trainer/agent_trainer.py:157-181
Tinker
For Megatron-based training (deprecated):
trainer = AgentTrainer(
agent_class = MyAgent,
env_class = MyEnv,
config = config,
train_dataset = train_dataset,
val_dataset = val_dataset,
backend = "tinker"
)
trainer.train()
Source code : rllm/trainer/agent_trainer.py:98-121
Under the Hood: verl Training
Let’s examine what happens during verl training:
AgentPPOTrainer
The core trainer class that orchestrates the RL loop:
class AgentPPOTrainer ( RayPPOTrainer ):
"""PPO trainer for agents with environments."""
def __init__ (
self ,
config ,
tokenizer ,
role_worker_mapping ,
resource_pool_manager ,
env_class ,
agent_class ,
env_args ,
agent_args ,
** kwargs
):
super (). __init__ ( ... )
self .env_class = env_class
self .agent_class = agent_class
# Initialize AgentExecutionEngine
self .agent_execution_engine = AsyncAgentExecutionEngine(
rollout_engine = self .async_rollout_manager,
config = self .config,
engine_name = "verl" ,
tokenizer = self .tokenizer,
max_steps = self .config.rllm.agent.max_steps,
agent_class = self .agent_class,
env_class = self .env_class,
...
)
Source code : rllm/trainer/verl/agent_ppo_trainer.py:33-88
Training Loop
The fit_agent() method runs the complete training loop:
def fit_agent ( self ):
"""Main training loop."""
# Load checkpoint if exists
self ._load_checkpoint()
# Validation before training
if self .val_reward_fn and self .config.trainer.val_before_train:
val_metrics = self ._validate_agent()
logger.log(val_metrics, step = 0 )
for epoch in range ( self .config.trainer.total_epochs):
for batch_idx, batch in enumerate (train_dataloader):
# 1. Generate trajectories
envs, agents = self .init_envs_and_agents(batch)
self .agent_execution_engine.update_envs_and_agents(envs, agents)
rollout_data = []
async for traj in self .agent_execution_engine.trajectory_generator():
rollout_data.append(traj)
# 2. Compute advantages
rollout_data = self .compute_advantages(rollout_data)
# 3. Update policy
metrics = self .update_policy(rollout_data)
# 4. Log metrics
logger.log(metrics, step = self .global_steps)
self .global_steps += 1
# 5. Save checkpoint
if self .global_steps % self .config.trainer.save_freq == 0 :
self ._save_checkpoint()
# 6. Validation
if self .global_steps % self .config.trainer.test_freq == 0 :
val_metrics = self ._validate_agent()
logger.log(val_metrics, step = self .global_steps)
Source code : rllm/trainer/verl/agent_ppo_trainer.py:126-315
Advantage Computation
rLLM supports multiple advantage computation methods:
def compute_advantages ( self , rollout_data ):
"""Compute advantages for trajectories."""
if self .config.algorithm.advantage.kl_ctrl.type == "grpo" :
# Group Relative Policy Optimization
# Compare trajectories within same task
advantages = compute_grpo_advantages(
rollout_data,
baseline = "mean" , # or "max"
)
elif self .config.algorithm.advantage.kl_ctrl.type == "ppo" :
# Proximal Policy Optimization
# Use value network for baseline
advantages = compute_ppo_advantages(
rollout_data,
value_network = self .critic,
gamma = self .config.algorithm.gamma,
lambda_ = self .config.algorithm.gae_lambda,
)
return advantages
See RL Algorithms for detailed explanations.
Datasets
rLLM uses the DatasetRegistry for managing training data:
from rllm.data import DatasetRegistry
# Load pre-registered datasets
train_dataset = DatasetRegistry.load_dataset( "math" , "train" )
val_dataset = DatasetRegistry.load_dataset( "math" , "test" )
# Or create custom dataset
from rllm.data import Dataset
custom_data = [
{ "question" : "What is 2+2?" , "answer" : "4" },
{ "question" : "What is 3+3?" , "answer" : "6" },
]
train_dataset = Dataset.from_list(custom_data, name = "custom_math" )
# Pass to trainer
trainer = AgentTrainer(
agent_class = MyAgent,
env_class = MyEnv,
train_dataset = train_dataset,
val_dataset = val_dataset,
config = config,
)
Distributed Training
rLLM uses Ray for distributed training:
Ray Configuration
ray_init :
address : null # Ray cluster address (null for local)
num_cpus : null # Number of CPUs (null for auto-detect)
num_gpus : null # Number of GPUs (null for auto-detect)
object_store_memory : null # Object store memory (null for default)
Multi-Node Training
# On head node:
ray start --head --port=6379
# On worker nodes:
ray start --address= < head-node-ip > :6379
# In training script:
@hydra.main(config_path = "pkg://rllm.trainer.config" , config_name="ppo_trainer", version_base=None )
def main ( config ) :
# Override Ray address
config.ray_init.address = "ray://<head-node-ip>:10001"
trainer = AgentTrainer (
agent_class = MyAgent,
env_class = MyEnv,
config = config,
...
)
trainer.train ()
Monitoring and Logging
rLLM supports multiple logging backends:
Weights & Biases
trainer :
logger : "wandb"
project_name : "rllm_training"
experiment_name : "math_agent_v1"
TensorBoard
trainer :
logger : "tensorboard"
project_name : "rllm_training"
Logged Metrics
The trainer automatically logs:
Rewards : Mean/std/min/max trajectory rewards
Success Rate : Percentage of successful trajectories
Episode Length : Mean/std trajectory length
Training Metrics : Policy loss, value loss, entropy, KL divergence
Timing : Rollout time, training time, total time
Checkpointing
Automatic Checkpointing
trainer :
save_freq : 100 # Save every 100 steps
checkpoint_dir : "checkpoints/"
Checkpoints are saved to {checkpoint_dir}/{experiment_name}/step_{global_steps}/
Manual Checkpointing
# In custom training loop
trainer._save_checkpoint()
# Load checkpoint
trainer._load_checkpoint()
Complete Training Example
Here’s a complete example training a math agent:
import hydra
from omegaconf import DictConfig
from rllm.trainer import AgentTrainer
from rllm.data import DatasetRegistry
from rllm.agents import ToolAgent
from rllm.environments import ToolEnvironment
from rllm.rewards import math_reward_fn
@hydra.main ( config_path = "pkg://rllm.trainer.config" , config_name = "ppo_trainer" , version_base = None )
def main ( config : DictConfig):
# Load datasets
train_dataset = DatasetRegistry.load_dataset( "math" , "train" )
val_dataset = DatasetRegistry.load_dataset( "math" , "test" )
# Agent configuration
agent_args = {
"tools" : [ "python" ],
"parser_name" : "qwen" ,
"system_prompt" : "Solve math problems step by step. Use Python for calculations."
}
# Environment configuration
env_args = {
"tools" : [ "python" ],
"reward_fn" : math_reward_fn,
"max_turns" : 5
}
# Create trainer
trainer = AgentTrainer(
agent_class = ToolAgent,
env_class = ToolEnvironment,
agent_args = agent_args,
env_args = env_args,
config = config,
train_dataset = train_dataset,
val_dataset = val_dataset,
)
# Train
trainer.train()
if __name__ == "__main__" :
main()
Run with :
python train_math_agent.py \
data.train_batch_size= 256 \
trainer.total_epochs= 10 \
actor_rollout_ref.model.path="Qwen/Qwen3-4B" \
algorithm.advantage.kl_ctrl.type="grpo"
Best Practices
Start Small : Begin with a small batch size and short epochs to validate your setup before scaling up.
Monitor Early : Use W&B or TensorBoard from the start to catch issues early.
Validation : Always run validation before training (val_before_train: true) to verify your setup.
Checkpointing : Set save_freq low initially (e.g., 10) to avoid losing progress.
Memory : Watch GPU memory usage. Reduce train_batch_size or max_response_length if OOM.
Hyperparameters : RL is sensitive to hyperparameters. Start with the defaults and tune carefully.
Next Steps
RL Algorithms Learn about PPO, GRPO, and other algorithms
Examples See complete training examples
Configuration Detailed configuration reference
Distributed Training Scale to multiple GPUs/nodes