This guide walks you through building a math reasoning agent with tool usage capabilities, from dataset preparation through reinforcement learning training.
What you’ll build
In this tutorial, you’ll create an agent that can:
Access a Python interpreter to solve mathematical problems
Perform step-by-step reasoning with tool calls
Learn and improve through reinforcement learning
Example setup : We’ll use Qwen3-4B as the base model, train on the DeepScaleR-Preview-Math dataset, and evaluate on AIME 2024 problems.
Prerequisites
Before starting, ensure you have:
rLLM installed with the verl backend (see installation )
At least 1 GPU with 16GB+ memory for inference
8+ GPUs for training (or adjust batch sizes for fewer GPUs)
Step 1: Prepare the dataset
rLLM’s DatasetRegistry provides centralized dataset management. Create a script to prepare the math datasets:
examples/math_tool/prepare_math_data.py
from datasets import load_dataset
from rllm.data.dataset import DatasetRegistry
def prepare_math_data ():
train_dataset = load_dataset( "agentica-org/DeepScaleR-Preview-Dataset" , split = "train" )
test_dataset = load_dataset( "HuggingFaceH4/aime_2024" , split = "train" )
def preprocess_fn ( example , idx ):
return {
"question" : example[ "problem" ],
"ground_truth" : example[ "answer" ],
"data_source" : "math" ,
}
train_dataset = train_dataset.map(preprocess_fn, with_indices = True )
test_dataset = test_dataset.map(preprocess_fn, with_indices = True )
train_dataset = DatasetRegistry.register_dataset( "deepscaler_math" , train_dataset, "train" )
test_dataset = DatasetRegistry.register_dataset( "aime2024" , test_dataset, "test" )
return train_dataset, test_dataset
if __name__ == "__main__" :
train_dataset, test_dataset = prepare_math_data()
print ( f "Training examples: { len (train_dataset) } " )
print ( f "Test examples: { len (test_dataset) } " )
Run the preparation script:
cd examples/math_tool
python prepare_math_data.py
This registers the datasets and stores them as parquet files for efficient loading during training and inference.
Step 2: Start the model server
rLLM requires a model server for inference. Choose vLLM or SGLang:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-4B \
--host 0.0.0.0 \
--port 30000 \
--dtype bfloat16 \
--tensor-parallel-size 1
For multi-GPU inference, increase --tensor-parallel-size (e.g., 4 for 4 GPUs). python -m sglang_router.launch_server \
--model-path Qwen/Qwen3-4B \
--dp-size 1 \
--dtype bfloat16
For data-parallel processing on multiple GPUs, increase --dp-size.
The server provides an OpenAI-compatible API at http://localhost:30000/v1.
Step 3: Run inference
Test your agent’s math problem-solving capability before training:
examples/math_tool/run_math_with_tool.py
from rllm.agents.tool_agent import ToolAgent
from rllm.engine import AgentExecutionEngine
from rllm.environments.tools import ToolEnvironment
from rllm.data.dataset import DatasetRegistry
# Load test dataset
test_dataset = DatasetRegistry.load_dataset( "aime2024" , split = "test" )
# Configure agent with Python tool access
agent = ToolAgent(
model_name = "Qwen/Qwen3-4B" ,
base_url = "http://localhost:30000/v1" ,
tools = [ "python" ],
max_steps = 10
)
# Create tool execution environment
env = ToolEnvironment( tools = [ "python" ])
# Initialize execution engine
engine = AgentExecutionEngine(
agent = agent,
env = env,
n_parallel_agents = 64 # Parallel agent-environment pairs
)
# Run inference on test set
results = engine.execute_tasks(test_dataset)
# Compute metrics
pass_at_1 = sum (r[ "correct" ] for r in results) / len (results)
print ( f "Pass@1 on AIME 2024: { pass_at_1 :.2%} " )
Run the inference script:
cd examples/math_tool
python run_math_with_tool.py
The AgentExecutionEngine orchestrates parallel agent-environment interactions, processing all test problems and computing accuracy metrics.
Step 4: Train with reinforcement learning
Improve your agent’s performance through RL training with GRPO (Group Relative Policy Optimization):
examples/math_tool/train_math_with_tool.py
from rllm.trainer import AgentTrainer
# Initialize trainer with GRPO
trainer = AgentTrainer(
agent = agent,
env = env,
train_dataset = "deepscaler_math" ,
test_dataset = "aime2024" ,
algorithm = "grpo" ,
backend = "verl" ,
# Training hyperparameters
num_iterations = 100 ,
batch_size = 512 ,
rollout_batch_size = 64 ,
n_parallel_agents = 64 ,
# GRPO-specific settings
gamma = 1.0 , # Discount factor
group_size = 8 , # Samples per problem for advantage estimation
)
# Launch training
trainer.train()
Run the training script:
cd examples/math_tool
bash train_math_with_tool.sh
Rollout generation
The AgentExecutionEngine generates trajectories by running agent-environment pairs in parallel on training data batches.
Advantage calculation
GRPO computes advantages by comparing each trajectory’s reward to the group average, encouraging high-reward actions.
Model update
The training backend updates model parameters to increase the probability of successful actions using PPO-style optimization.
Iteration
The updated model generates new trajectories for the next batch, continuing the improvement cycle.
Key components
Component Purpose Example usage ToolAgentAgent with tool capabilities Reasoning + Python execution ToolEnvironmentSafe tool execution Sandboxed Python interpreter DatasetRegistryDataset management Load/register datasets AgentExecutionEngineParallel agent execution Efficient batch inference AgentTrainerRL training orchestration GRPO/PPO-based training
Next steps
Core concepts Learn about rLLM’s architecture and key abstractions
Building agents Create custom agents for your domain
Custom environments Design environments and reward functions
More examples Explore additional examples and use cases