Skip to main content
In this tutorial, you’ll build a two-agent system where:
  • Solver: Generates candidate solutions to a problem
  • Judge: Evaluates and selects the best solution
This pattern is powerful for training agents that can both generate and verify solutions.
Training Curve

Overview

By the end of this tutorial, you will have:
  1. Built a Solver agent that generates multiple solution candidates
  2. Built a Judge agent that selects the best solution
  3. Assigned separate rewards to each agent using @trajectory
  4. Trained the multi-agent system end-to-end
Dataset: Countdown - Given numbers, reach a target using arithmetic operations.

Why Multi-Agent?

In a multi-agent system, you have multiple rollout functions (Solver and Judge), and each gets its own reward.

Concepts

We will cover:
  • @trajectory decorator: Automatic session management and trace capture
  • TrajectoryView: Access to steps, results, and rewards
  • Multi-agent workflows: Composing multiple agents with independent rewards

Setup

1

Install dependencies

Install rLLM if you haven’t already:
pip install rllm
2

Prepare the dataset

Download the Countdown dataset:
python -m rllm.data.prepare_countdown
3

Launch a vLLM server

Start a vLLM server for testing:
vllm serve Qwen/Qwen3-4B-Instruct-2507 \
    --host 0.0.0.0 \
    --port 4000

1. Understanding @trajectory

The @trajectory decorator automatically:
  • Tracks all LLM calls as steps
  • Returns a TrajectoryView with steps and result

1.1 Basic usage

from rllm.sdk import trajectory, get_chat_client_async

@trajectory(name="my_agent")
async def my_agent(prompt: str):
    client = get_chat_client_async(
        base_url="http://localhost:4000/v1", 
        api_key="EMPTY", 
        use_proxy=False  # set to False when using vLLM server directly
    )
    response = await client.chat.completions.create(
        model="Qwen/Qwen3-4B-Instruct-2507",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

1.2 What you get back

traj = await my_agent("What is 2+2?")

# traj is a TrajectoryView with:
print("Agent Name:", traj.name)     # "my_agent"
print("Response:", traj.result)      # "4" (your return value)
print("Steps:", traj.steps)          # [StepView(...)] - one per LLM call
print("Reward:", traj.reward)        # 0.0 (default, you can set this)

2. Countdown Task

Given a target number and a list of numbers, create an equation using the given numbers to reach the target. Example:
  • Target: 150
  • Numbers: [3, 50]
  • Valid solution: 3 * 50 = 150

3. Build the Solver Agent

The Solver generates solution candidates for Countdown puzzles.

3.1 Define the Solver class

import asyncio
import re
from rllm.sdk import trajectory, get_chat_client_async

SOLVER_PROMPT = "{problem}. Output the final answer within <answer>...</answer>"

class Solver:
    def __init__(self, use_proxy: bool = False):
        self.client = get_chat_client_async(
            base_url="http://localhost:4000/v1", 
            api_key="token-abc123",
            use_proxy=use_proxy,
        )
        self.model = "Qwen/Qwen3-4B-Instruct-2507"

    @trajectory(name="solver")
    async def generate_solution(self, problem: str):
        """Generate a single solution. Returns TrajectoryView automatically."""
        prompt = SOLVER_PROMPT.format(problem=problem)
        
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=1.0,  # Higher temperature for diverse solutions
            max_tokens=1000,
        )
        
        response_text = response.choices[0].message.content
        return self._parse_answer(response_text)

    def _parse_answer(self, response: str) -> str:
        """Extract answer from <answer>...</answer> tags."""
        match = re.search(r"<answer>(.*?)</answer>", response, re.IGNORECASE | re.DOTALL)
        if match:
            return f"<answer>{match.group(1).strip()}</answer>"
        return ""

    async def generate_solutions(self, problem: str, n_solutions: int = 2):
        """Generate multiple solutions concurrently."""
        tasks = [
            asyncio.create_task(self.generate_solution(problem))
            for _ in range(n_solutions)
        ]
        return await asyncio.gather(*tasks)
Why <answer> tags? The reward function looks for <answer>equation</answer> to extract the solution. Without it, the reward function cannot find your answer—similar to \boxed{} in math problems.

3.2 Test the Solver

solver = Solver()

# Generate 2 solutions for a Countdown puzzle
problem = "Using numbers [3, 50], reach target 150"
trajs = await solver.generate_solutions(problem, n_solutions=2)

for i, traj in enumerate(trajs):
    print(f"Solution {i+1}: {traj.result}")
    print(f"Collected LLM Calls: {len(traj.steps)}")
Expected output:
Solution 1: <answer>3 * 50 = 150</answer>
Collected LLM Calls: 1
Solution 2: <answer>3 * 50</answer>
Collected LLM Calls: 1

4. Build the Judge Agent

The Judge evaluates solutions and selects the best one.

4.1 Define the Judge class

JUDGE_PROMPT = """You are an expert verifier. Given a countdown problem and multiple solution attempts, select a correct solution.
Problem:
{problem}
Solutions to evaluate:

{solutions}

A correct solution must:
1. Use only the given numbers
2. Use each number exactly once
3. Use only basic arithmetic operations (+, -, *, /)
4. Result in the target number
5. Be marked within <answer>...</answer> tags

Output the index of your selected solution within <answer>...</answer> tags, e.g., <answer>1</answer> for the first solution."""

class Judge:
    def __init__(self, use_proxy: bool = False):
        self.client = get_chat_client_async(
            base_url="http://localhost:4000/v1", 
            api_key="token-abc123",
            use_proxy=use_proxy,
        )
        self.model = "Qwen/Qwen3-4B-Instruct-2507"

    @trajectory(name="judge")
    async def judge_solutions(self, problem: str, solutions: list[str]):
        """Evaluate solutions and select the best one."""
        # Format solutions list
        solutions_text = ""
        for i, sol in enumerate(solutions, 1):
            solutions_text += f"\nSolution {i}:\n{sol}\n"
        
        prompt = JUDGE_PROMPT.format(problem=problem, solutions=solutions_text)
        
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=1.0,
            max_tokens=2000,
        )
        
        response_text = response.choices[0].message.content
        return self._parse_selection(response_text, solutions)

    def _parse_selection(self, response: str, solutions: list[str]) -> str:
        """Extract selected solution index."""
        match = re.search(r"<answer>(\d+)</answer>", response)
        if match:
            idx = int(match.group(1)) - 1
            if 0 <= idx < len(solutions):
                return solutions[idx]
        return ""

5. Compose the Workflow

Now combine Solver and Judge, assigning rewards to each trajectory.
from rllm.sdk import TrajectoryView
from rllm.rewards.countdown_reward import countdown_reward_fn

class SolverJudgeWorkflow:
    def __init__(self, n_solutions: int = 2, **kwargs):
        self.n_solutions = n_solutions
        self.reward_function = countdown_reward_fn
        self.solver = Solver(use_proxy=True)
        self.judge = Judge(use_proxy=True)

    async def run(self, task: dict, **kwargs) -> list[TrajectoryView]:
        """Run the full workflow and return all trajectories."""
        problem = task["question"]

        # Step 1: Generate multiple solutions
        solver_trajs = await self.solver.generate_solutions(problem, self.n_solutions)

        # Step 2: Assign rewards to each solver
        solutions = []
        for traj in solver_trajs:
            parsed_answer = traj.result
            reward = self.reward_function(task, parsed_answer).reward
            
            # Assign reward to the trajectory AND its steps
            traj.steps[0].reward = reward
            traj.reward = reward
            solutions.append(parsed_answer)

        # Step 3: Judge selects the best solution
        judge_traj = await self.judge.judge_solutions(problem, solutions)
        selected = judge_traj.result
        
        # Judge reward based on final selection quality
        judge_reward = self.reward_function(task, selected).reward
        judge_traj.steps[0].reward = judge_reward
        judge_traj.reward = judge_reward

        # Return ALL trajectories for training
        return solver_trajs + [judge_traj]

5.1 Reward assignment strategy

Example run:
┌─────────────────────────────────────────────────┐
│ Problem: Reach 150 with [3, 50]                 │
├─────────────────────────────────────────────────┤
│ Solver 1: "100 + 50 = 150"  → reward = 0.0 ✗    │
│ Solver 2: "3 * 50 = 150"    → reward = 1.0 ✓    │
│ Judge: selects Solver 2     → reward = 1.0 ✓    │
└─────────────────────────────────────────────────┘

Training signal:
• Solver 2 is reinforced (correct answer)
• Solver 1 learns to improve (wrong answer)
• Judge learns to identify correct solutions

6. Run Training

import hydra
from rllm.data.dataset import DatasetRegistry
from rllm.trainer.agent_trainer import AgentTrainer

async def run_workflow(**kwargs) -> list[TrajectoryView]:
    """Training wrapper that returns trajectories."""
    workflow = SolverJudgeWorkflow(n_solutions=2)
    return await workflow.run(kwargs)

@hydra.main(
    config_path="pkg://rllm.trainer.config", 
    config_name="agent_ppo_trainer", 
    version_base=None
)
def main(config):
    train_dataset = DatasetRegistry.load_dataset("countdown", "train")
    test_dataset = DatasetRegistry.load_dataset("countdown", "test")

    trainer = AgentTrainer(
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
        agent_run_func=run_workflow,
    )
    trainer.train()

if __name__ == "__main__":
    main()
Launch training:
cd rllm
bash examples/sdk/solver_judge/train_decorator.sh

Next Steps

Resources

Build docs developers (and LLMs) love