Skip to main content
This guide walks you through building a math reasoning agent with tool usage capabilities, from dataset preparation through reinforcement learning training.

What you’ll build

In this tutorial, you’ll create an agent that can:
  • Access a Python interpreter to solve mathematical problems
  • Perform step-by-step reasoning with tool calls
  • Learn and improve through reinforcement learning
Example setup: We’ll use Qwen3-4B as the base model, train on the DeepScaleR-Preview-Math dataset, and evaluate on AIME 2024 problems.

Prerequisites

Before starting, ensure you have:
  • rLLM installed with the verl backend (see installation)
  • At least 1 GPU with 16GB+ memory for inference
  • 8+ GPUs for training (or adjust batch sizes for fewer GPUs)

Step 1: Prepare the dataset

rLLM’s DatasetRegistry provides centralized dataset management. Create a script to prepare the math datasets:
examples/math_tool/prepare_math_data.py
from datasets import load_dataset
from rllm.data.dataset import DatasetRegistry

def prepare_math_data():
    train_dataset = load_dataset("agentica-org/DeepScaleR-Preview-Dataset", split="train")
    test_dataset = load_dataset("HuggingFaceH4/aime_2024", split="train")

    def preprocess_fn(example, idx):
        return {
            "question": example["problem"],
            "ground_truth": example["answer"],
            "data_source": "math",
        }

    train_dataset = train_dataset.map(preprocess_fn, with_indices=True)
    test_dataset = test_dataset.map(preprocess_fn, with_indices=True)

    train_dataset = DatasetRegistry.register_dataset("deepscaler_math", train_dataset, "train")
    test_dataset = DatasetRegistry.register_dataset("aime2024", test_dataset, "test")
    return train_dataset, test_dataset

if __name__ == "__main__":
    train_dataset, test_dataset = prepare_math_data()
    print(f"Training examples: {len(train_dataset)}")
    print(f"Test examples: {len(test_dataset)}")
Run the preparation script:
cd examples/math_tool
python prepare_math_data.py
This registers the datasets and stores them as parquet files for efficient loading during training and inference.

Step 2: Start the model server

rLLM requires a model server for inference. Choose vLLM or SGLang:
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-4B \
    --host 0.0.0.0 \
    --port 30000 \
    --dtype bfloat16 \
    --tensor-parallel-size 1
For multi-GPU inference, increase --tensor-parallel-size (e.g., 4 for 4 GPUs).
The server provides an OpenAI-compatible API at http://localhost:30000/v1.

Step 3: Run inference

Test your agent’s math problem-solving capability before training:
examples/math_tool/run_math_with_tool.py
from rllm.agents.tool_agent import ToolAgent
from rllm.engine import AgentExecutionEngine
from rllm.environments.tools import ToolEnvironment
from rllm.data.dataset import DatasetRegistry

# Load test dataset
test_dataset = DatasetRegistry.load_dataset("aime2024", split="test")

# Configure agent with Python tool access
agent = ToolAgent(
    model_name="Qwen/Qwen3-4B",
    base_url="http://localhost:30000/v1",
    tools=["python"],
    max_steps=10
)

# Create tool execution environment
env = ToolEnvironment(tools=["python"])

# Initialize execution engine
engine = AgentExecutionEngine(
    agent=agent,
    env=env,
    n_parallel_agents=64  # Parallel agent-environment pairs
)

# Run inference on test set
results = engine.execute_tasks(test_dataset)

# Compute metrics
pass_at_1 = sum(r["correct"] for r in results) / len(results)
print(f"Pass@1 on AIME 2024: {pass_at_1:.2%}")
Run the inference script:
cd examples/math_tool
python run_math_with_tool.py
The AgentExecutionEngine orchestrates parallel agent-environment interactions, processing all test problems and computing accuracy metrics.

Step 4: Train with reinforcement learning

Improve your agent’s performance through RL training with GRPO (Group Relative Policy Optimization):
examples/math_tool/train_math_with_tool.py
from rllm.trainer import AgentTrainer

# Initialize trainer with GRPO
trainer = AgentTrainer(
    agent=agent,
    env=env,
    train_dataset="deepscaler_math",
    test_dataset="aime2024",
    algorithm="grpo",
    backend="verl",
    # Training hyperparameters
    num_iterations=100,
    batch_size=512,
    rollout_batch_size=64,
    n_parallel_agents=64,
    # GRPO-specific settings
    gamma=1.0,  # Discount factor
    group_size=8,  # Samples per problem for advantage estimation
)

# Launch training
trainer.train()
Run the training script:
cd examples/math_tool
bash train_math_with_tool.sh
1

Rollout generation

The AgentExecutionEngine generates trajectories by running agent-environment pairs in parallel on training data batches.
2

Advantage calculation

GRPO computes advantages by comparing each trajectory’s reward to the group average, encouraging high-reward actions.
3

Model update

The training backend updates model parameters to increase the probability of successful actions using PPO-style optimization.
4

Iteration

The updated model generates new trajectories for the next batch, continuing the improvement cycle.

Key components

ComponentPurposeExample usage
ToolAgentAgent with tool capabilitiesReasoning + Python execution
ToolEnvironmentSafe tool executionSandboxed Python interpreter
DatasetRegistryDataset managementLoad/register datasets
AgentExecutionEngineParallel agent executionEfficient batch inference
AgentTrainerRL training orchestrationGRPO/PPO-based training

Next steps

Core concepts

Learn about rLLM’s architecture and key abstractions

Building agents

Create custom agents for your domain

Custom environments

Design environments and reward functions

More examples

Explore additional examples and use cases

Build docs developers (and LLMs) love