Core Concepts Overview

rLLM is an open-source framework for post-training language agents via reinforcement learning. It provides a modular architecture that makes it easy to build, train, and deploy agentic systems that learn from environmental feedback.

The RL Training Loop

A typical RL system consists of two core components:

Sampler: Generates trajectories from the current policy (i.e., the agent interacting with environments)
Trainer: Computes gradients from the sampled trajectories and updates the policy

In online RL, this forms a closed training loop:

Trajectory Generation

The sampler generates a batch of trajectories using the current agent policy

Policy Update

The trainer updates the agent’s weights using those trajectories

Iteration

A new batch is generated using the updated agent, and the cycle repeats

rLLM’s Modular Architecture

rLLM implements this training loop through several modular components:

1. Agent and Environment Abstractions

BaseAgent and BaseEnv provide simple, extensible interfaces for defining custom agents and environments:

BaseAgent: Manages state, processes observations, interacts with language models, and tracks trajectories
BaseEnv: Defines tasks, evaluates actions, provides rewards, and manages episode lifecycles

Learn more about Agents and Environments →

2. Execution Engines

rLLM provides two engines for orchestrating agent-environment interactions:

AgentExecutionEngine

A low-level, high-performance engine for simple agent-environment interactions:

Fully asynchronous and parallel trajectory generation
Direct agent-environment step-by-step orchestration
Optimized for single-agent tasks
Supports both OpenAI API and vLLM backends

Learn more about AgentExecutionEngine →

AgentWorkflowEngine

A high-level engine for complex, multi-agent workflows:

Supports sophisticated multi-agent orchestration
Workflow-based abstraction for complex reasoning chains
Episode-level management and metrics
Built-in retry logic and error handling

Learn more about AgentWorkflowEngine →

3. Training Infrastructure

AgentTrainer orchestrates the RL training loop:

Integrates sampler (execution engines) with trainer (verl)
Supports PPO, GRPO, and other RL algorithms
Distributed training via Ray
Simple high-level API for training configuration

Learn more about Training →

Architecture Diagram

Here’s how the components fit together:

Key Data Structures

rLLM uses several core data structures to represent agent interactions:

Step

Represents a single interaction turn:

@dataclass
class Step:
    prompt_ids: list[int]          # Token IDs for the prompt
    response_ids: list[int]         # Token IDs for the response
    logprobs: list[float]           # Log probabilities
    chat_completions: list[dict]    # OpenAI-style messages
    observation: Any                # Environment observation
    thought: str                    # Agent's reasoning
    action: Any                     # Agent's action
    model_response: str             # Raw model output
    reward: float                   # Step reward
    done: bool                      # Episode termination flag
    advantage: float | list[float]  # Advantage for training

Trajectory

Represents a sequence of steps for a single agent:

@dataclass
class Trajectory:
    uid: str                  # Unique identifier
    name: str                 # Agent/role name
    task: Any                 # Task information
    steps: list[Step]         # Interaction history
    reward: float | None      # Trajectory-level reward
    info: dict                # Additional metadata

Episode

Represents a complete rollout (potentially multi-agent):

@dataclass
class Episode:
    id: str                         # Episode identifier (task_id:rollout_idx)
    task: Any                       # Task information
    termination_reason: TerminationReason  # Why episode ended
    is_correct: bool                # Success flag
    trajectories: list[Trajectory]  # All agent trajectories
    metrics: dict                   # Evaluation metrics

See detailed API reference →

RL Algorithms

rLLM supports multiple RL algorithms optimized for language agent training:

PPO (Proximal Policy Optimization): Industry-standard policy gradient method
GRPO (Group Relative Policy Optimization): Efficient algorithm for language models
ReMax: Reward maximization with KL regularization

Learn more about RL Algorithms →

Quick Start Example

Here’s a minimal example showing how the components work together:

from rllm.agents.agent import BaseAgent
from rllm.environments.base import SingleTurnEnvironment
from rllm.engine import AgentExecutionEngine
from rllm.trainer import AgentTrainer
from rllm.data import DatasetRegistry

# 1. Define your agent and environment
class MyAgent(BaseAgent):
    # Implement agent logic
    pass

class MyEnv(SingleTurnEnvironment):
    # Implement environment logic
    pass

# 2. Use execution engine for inference
engine = AgentExecutionEngine(
    agent_class=MyAgent,
    env_class=MyEnv,
    engine_name="openai",
    tokenizer=tokenizer,
    n_parallel_agents=64
)

tasks = DatasetRegistry.load_dataset("my_dataset", "test").get_data()
results = await engine.execute_tasks(tasks)

# 3. Use trainer for RL training
trainer = AgentTrainer(
    agent_class=MyAgent,
    env_class=MyEnv,
    config=config,
    train_dataset=train_dataset,
    val_dataset=val_dataset
)
trainer.train()

Design Philosophy

rLLM’s architecture follows these principles:

Modularity: Each component has a clear responsibility and can be used independently or composed together.

Flexibility: The framework supports both simple single-agent tasks and complex multi-agent workflows.

Performance: Built-in asynchronous execution and distributed training for scalability.

Compatibility: Integrates with standard tools (OpenAI API, HuggingFace, Ray, verl).

Next Steps

Explore each component in detail:

Agents & Environments

Learn how to build custom agents and environments

Execution Engine

Understand trajectory generation and orchestration

Workflow Engine

Build complex multi-agent workflows

Training

Train agents with reinforcement learning

Get Started

Core Concepts

SDK

Training Backends

Guides

Core Concepts Overview

The RL Training Loop

rLLM’s Modular Architecture

1. Agent and Environment Abstractions

2. Execution Engines

AgentExecutionEngine

AgentWorkflowEngine

3. Training Infrastructure

Architecture Diagram

Key Data Structures

Step

Trajectory

Episode

RL Algorithms

Quick Start Example

Design Philosophy

Next Steps

Agents & Environments

Execution Engine

Workflow Engine

Training

Build docs developers (and LLMs) love

Get Started

Core Concepts

SDK

Training Backends

Guides

​The RL Training Loop

​rLLM’s Modular Architecture

​1. Agent and Environment Abstractions

​2. Execution Engines

​AgentExecutionEngine

​AgentWorkflowEngine

​3. Training Infrastructure

​Architecture Diagram

​Key Data Structures

​Step

​Trajectory

​Episode

​RL Algorithms

​Quick Start Example

​Design Philosophy

​Next Steps

Agents & Environments

Execution Engine

Workflow Engine

Training

Build docs developers (and LLMs) love

The RL Training Loop

rLLM’s Modular Architecture

1. Agent and Environment Abstractions

2. Execution Engines

AgentExecutionEngine

AgentWorkflowEngine

3. Training Infrastructure

Architecture Diagram

Key Data Structures

Step

Trajectory

Episode

RL Algorithms

Quick Start Example

Design Philosophy

Next Steps