rLLM is an open-source framework for post-training language agents via reinforcement learning. It provides a modular architecture that makes it easy to build, train, and deploy agentic systems that learn from environmental feedback.
The RL Training Loop
A typical RL system consists of two core components:- Sampler: Generates trajectories from the current policy (i.e., the agent interacting with environments)
- Trainer: Computes gradients from the sampled trajectories and updates the policy
rLLM’s Modular Architecture
rLLM implements this training loop through several modular components:1. Agent and Environment Abstractions
BaseAgent and BaseEnv provide simple, extensible interfaces for defining custom agents and environments:
- BaseAgent: Manages state, processes observations, interacts with language models, and tracks trajectories
- BaseEnv: Defines tasks, evaluates actions, provides rewards, and manages episode lifecycles
2. Execution Engines
rLLM provides two engines for orchestrating agent-environment interactions:AgentExecutionEngine
A low-level, high-performance engine for simple agent-environment interactions:- Fully asynchronous and parallel trajectory generation
- Direct agent-environment step-by-step orchestration
- Optimized for single-agent tasks
- Supports both OpenAI API and vLLM backends
AgentWorkflowEngine
A high-level engine for complex, multi-agent workflows:- Supports sophisticated multi-agent orchestration
- Workflow-based abstraction for complex reasoning chains
- Episode-level management and metrics
- Built-in retry logic and error handling
3. Training Infrastructure
AgentTrainer orchestrates the RL training loop:
- Integrates sampler (execution engines) with trainer (verl)
- Supports PPO, GRPO, and other RL algorithms
- Distributed training via Ray
- Simple high-level API for training configuration
Architecture Diagram
Here’s how the components fit together:Key Data Structures
rLLM uses several core data structures to represent agent interactions:Step
Represents a single interaction turn:Trajectory
Represents a sequence of steps for a single agent:Episode
Represents a complete rollout (potentially multi-agent):RL Algorithms
rLLM supports multiple RL algorithms optimized for language agent training:- PPO (Proximal Policy Optimization): Industry-standard policy gradient method
- GRPO (Group Relative Policy Optimization): Efficient algorithm for language models
- ReMax: Reward maximization with KL regularization
Quick Start Example
Here’s a minimal example showing how the components work together:Design Philosophy
rLLM’s architecture follows these principles:Modularity: Each component has a clear responsibility and can be used independently or composed together.
Flexibility: The framework supports both simple single-agent tasks and complex multi-agent workflows.
Performance: Built-in asynchronous execution and distributed training for scalability.
Compatibility: Integrates with standard tools (OpenAI API, HuggingFace, Ray, verl).
Next Steps
Explore each component in detail:Agents & Environments
Learn how to build custom agents and environments
Execution Engine
Understand trajectory generation and orchestration
Workflow Engine
Build complex multi-agent workflows
Training
Train agents with reinforcement learning