Quick start

This guide walks you through building a math reasoning agent with tool usage capabilities, from dataset preparation through reinforcement learning training.

What you’ll build

In this tutorial, you’ll create an agent that can:

Access a Python interpreter to solve mathematical problems
Perform step-by-step reasoning with tool calls
Learn and improve through reinforcement learning

Example setup: We’ll use Qwen3-4B as the base model, train on the DeepScaleR-Preview-Math dataset, and evaluate on AIME 2024 problems.

Prerequisites

Before starting, ensure you have:

rLLM installed with the verl backend (see installation)
At least 1 GPU with 16GB+ memory for inference
8+ GPUs for training (or adjust batch sizes for fewer GPUs)

Step 1: Prepare the dataset

rLLM’s DatasetRegistry provides centralized dataset management. Create a script to prepare the math datasets:

examples/math_tool/prepare_math_data.py

from datasets import load_dataset
from rllm.data.dataset import DatasetRegistry

def prepare_math_data():
    train_dataset = load_dataset("agentica-org/DeepScaleR-Preview-Dataset", split="train")
    test_dataset = load_dataset("HuggingFaceH4/aime_2024", split="train")

    def preprocess_fn(example, idx):
        return {
            "question": example["problem"],
            "ground_truth": example["answer"],
            "data_source": "math",
        }

    train_dataset = train_dataset.map(preprocess_fn, with_indices=True)
    test_dataset = test_dataset.map(preprocess_fn, with_indices=True)

    train_dataset = DatasetRegistry.register_dataset("deepscaler_math", train_dataset, "train")
    test_dataset = DatasetRegistry.register_dataset("aime2024", test_dataset, "test")
    return train_dataset, test_dataset

if __name__ == "__main__":
    train_dataset, test_dataset = prepare_math_data()
    print(f"Training examples: {len(train_dataset)}")
    print(f"Test examples: {len(test_dataset)}")

Run the preparation script:

cd examples/math_tool
python prepare_math_data.py

This registers the datasets and stores them as parquet files for efficient loading during training and inference.

Step 2: Start the model server

rLLM requires a model server for inference. Choose vLLM or SGLang:

vLLM
SGLang

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-4B \
    --host 0.0.0.0 \
    --port 30000 \
    --dtype bfloat16 \
    --tensor-parallel-size 1

For multi-GPU inference, increase --tensor-parallel-size (e.g., 4 for 4 GPUs).

python -m sglang_router.launch_server \
    --model-path Qwen/Qwen3-4B \
    --dp-size 1 \
    --dtype bfloat16

For data-parallel processing on multiple GPUs, increase --dp-size.

The server provides an OpenAI-compatible API at http://localhost:30000/v1.

Step 3: Run inference

Test your agent’s math problem-solving capability before training:

examples/math_tool/run_math_with_tool.py

from rllm.agents.tool_agent import ToolAgent
from rllm.engine import AgentExecutionEngine
from rllm.environments.tools import ToolEnvironment
from rllm.data.dataset import DatasetRegistry

# Load test dataset
test_dataset = DatasetRegistry.load_dataset("aime2024", split="test")

# Configure agent with Python tool access
agent = ToolAgent(
    model_name="Qwen/Qwen3-4B",
    base_url="http://localhost:30000/v1",
    tools=["python"],
    max_steps=10
)

# Create tool execution environment
env = ToolEnvironment(tools=["python"])

# Initialize execution engine
engine = AgentExecutionEngine(
    agent=agent,
    env=env,
    n_parallel_agents=64  # Parallel agent-environment pairs
)

# Run inference on test set
results = engine.execute_tasks(test_dataset)

# Compute metrics
pass_at_1 = sum(r["correct"] for r in results) / len(results)
print(f"Pass@1 on AIME 2024: {pass_at_1:.2%}")

Run the inference script:

cd examples/math_tool
python run_math_with_tool.py

The AgentExecutionEngine orchestrates parallel agent-environment interactions, processing all test problems and computing accuracy metrics.

Step 4: Train with reinforcement learning

Improve your agent’s performance through RL training with GRPO (Group Relative Policy Optimization):

examples/math_tool/train_math_with_tool.py

from rllm.trainer import AgentTrainer

# Initialize trainer with GRPO
trainer = AgentTrainer(
    agent=agent,
    env=env,
    train_dataset="deepscaler_math",
    test_dataset="aime2024",
    algorithm="grpo",
    backend="verl",
    # Training hyperparameters
    num_iterations=100,
    batch_size=512,
    rollout_batch_size=64,
    n_parallel_agents=64,
    # GRPO-specific settings
    gamma=1.0,  # Discount factor
    group_size=8,  # Samples per problem for advantage estimation
)

# Launch training
trainer.train()

Run the training script:

cd examples/math_tool
bash train_math_with_tool.sh

Rollout generation

The AgentExecutionEngine generates trajectories by running agent-environment pairs in parallel on training data batches.

Advantage calculation

GRPO computes advantages by comparing each trajectory’s reward to the group average, encouraging high-reward actions.

Model update

The training backend updates model parameters to increase the probability of successful actions using PPO-style optimization.

Iteration

The updated model generates new trajectories for the next batch, continuing the improvement cycle.

Key components

Component	Purpose	Example usage
`ToolAgent`	Agent with tool capabilities	Reasoning + Python execution
`ToolEnvironment`	Safe tool execution	Sandboxed Python interpreter
`DatasetRegistry`	Dataset management	Load/register datasets
`AgentExecutionEngine`	Parallel agent execution	Efficient batch inference
`AgentTrainer`	RL training orchestration	GRPO/PPO-based training

Next steps

Core concepts

Learn about rLLM’s architecture and key abstractions

Building agents

Create custom agents for your domain

Custom environments

Design environments and reward functions

More examples

Explore additional examples and use cases

Get Started

Core Concepts

SDK

Training Backends

Guides

Quick start

What you’ll build

Prerequisites

Step 1: Prepare the dataset

Step 2: Start the model server

Step 3: Run inference

Step 4: Train with reinforcement learning

Key components

Next steps

Core concepts

Building agents

Custom environments

More examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

SDK

Training Backends

Guides

​What you’ll build

​Prerequisites

​Step 1: Prepare the dataset

​Step 2: Start the model server

​Step 3: Run inference

​Step 4: Train with reinforcement learning

​Key components

​Next steps

Core concepts

Building agents

Custom environments

More examples

Build docs developers (and LLMs) love

What you’ll build

Prerequisites

Step 1: Prepare the dataset

Step 2: Start the model server

Step 3: Run inference

Step 4: Train with reinforcement learning

Key components

Next steps