Skip to main content
This example demonstrates building a tool-based environment where models search Wikipedia to answer trivia questions. It showcases custom tool implementation, vector search with embeddings, and LLM-as-judge evaluation.

Overview

The Wiki Search environment provides:
  • Tools: Three custom Wikipedia search tools
  • Search: Semantic search using embeddings (ChromaDB + OpenAI)
  • Dataset: Wikipedia trivia questions
  • Evaluation: LLM judge for flexible answer matching
  • Format: Multi-turn tool use

Complete Implementation

Here’s the full working implementation from environments/wiki_search/wiki_search.py:
import asyncio
import os
from typing import cast

import chromadb
from chromadb.api.types import Embeddable, EmbeddingFunction
from chromadb.utils import embedding_functions
from datasets import load_dataset
from openai import AsyncOpenAI

import verifiers as vf
from verifiers.rubrics.judge_rubric import JudgeRubric

CHROMA_DB_DIR = ".chroma_db"
_chroma_semaphore: asyncio.Semaphore | None = None

def _get_chroma_semaphore() -> asyncio.Semaphore:
    global _chroma_semaphore
    if _chroma_semaphore is None:
        _chroma_semaphore = asyncio.Semaphore(100)
    return _chroma_semaphore

def load_environment(
    max_turns: int = 10,
    judge_model: str = "gpt-4.1-mini",
    judge_base_url: str = "https://api.openai.com/v1",
    judge_api_key_var: str = "OPENAI_API_KEY",
    embed_model: str = "text-embedding-3-small",
    embed_base_url: str = "https://api.openai.com/v1",
    embed_api_key_var: str = "OPENAI_API_KEY",
    corpus_dataset: str = "willcb/rare-wiki-pages",
    corpus_split: str = "train",
    chroma_db_dir: str = CHROMA_DB_DIR,
) -> vf.Environment:
    # Load corpus into memory
    corpus = load_dataset(corpus_dataset, split=corpus_split)
    page_id_to_title: dict[str, str] = {}
    page_id_to_content: dict[str, str] = {}
    for row in corpus:
        row = cast(dict, row)
        pid = row["id"]
        page_id_to_title[pid] = row["title"]
        page_id_to_content[pid] = row["content"]

How It Works

1. Custom Tool Implementation

Three tools provide Wikipedia access: search_pages(query) - Semantic search for relevant articles
  • Uses embedding similarity (text-embedding-3-small)
  • Returns top 10 matching pages with IDs and titles
view_sections(page_id) - List sections of an article
  • Parses markdown headers to extract sections
  • Returns section IDs and names
read_section(section_id) - Read section content
  • Extracts text between markdown headers
  • Returns full content of the section

2. Vector Search with ChromaDB

Embedding-based search enables semantic matching:
# Initialize ChromaDB with OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    model_name="text-embedding-3-small",
    api_base="https://api.openai.com/v1",
    api_key=os.getenv("OPENAI_API_KEY"),
)

client = chromadb.PersistentClient(path=".chroma_db")
collection = client.get_or_create_collection(
    name="wiki_titles",
    embedding_function=openai_ef,
)
Features:
  • Persistent storage (.chroma_db directory)
  • Lazy initialization (loads on first search)
  • Automatic upsert of missing pages
  • Concurrent access via semaphore (max 100 parallel)

3. LLM Judge Evaluation

JudgeRubric uses an LLM to evaluate answers:
judge_rubric = JudgeRubric(
    judge_client=judge_client,
    judge_model="gpt-4.1-mini",
    judge_prompt=JUDGE_PROMPT,
)

async def judge_reward_func(judge, prompt, completion, answer, state) -> float:
    judge_response = await judge(prompt, completion, answer, state)
    return 1.0 if "yes" in judge_response.lower() else 0.0
Why judge evaluation?
  • Trivia answers have multiple valid phrasings
  • Example: “Barack Obama” = “Obama” = “Barack H. Obama”
  • Judge can handle paraphrasing and context

4. Tool Environment

ToolEnv provides tool calling capabilities:
vf_env = vf.ToolEnv(
    dataset=dataset,
    tools=[search_pages, view_sections, read_section],
    max_turns=10,
    rubric=judge_rubric,
)
The environment:
  • Extracts tool schemas from function signatures
  • Executes tool calls and returns results
  • Tracks tool usage metrics

Example Interaction

Question: Who painted the Mona Lisa?
Assistant: I’ll search for information about the Mona Lisa painting.Tool Call: search_pages("Mona Lisa painting")Tool Result:
[
  {"page_id": "mona_lisa", "title": "Mona Lisa"},
  {"page_id": "leonardo_da_vinci", "title": "Leonardo da Vinci"},
  {"page_id": "louvre", "title": "Louvre Museum"},
  ...
]

Assistant: Let me view the sections of the Mona Lisa article.Tool Call: view_sections("mona_lisa")Tool Result:
[
  {"section_id": "mona_lisa:description", "section_name": "Description"},
  {"section_id": "mona_lisa:history", "section_name": "History"},
  {"section_id": "mona_lisa:artist", "section_name": "Artist"},
  ...
]

Assistant: I’ll read the artist section.Tool Call: read_section("mona_lisa:artist")Tool Result:
# Artist

The Mona Lisa was painted by Leonardo da Vinci in the early 16th century, 
between 1503 and 1519. It is considered one of his most famous works...

Assistant: Based on the Wikipedia article, Leonardo da Vinci painted the Mona Lisa.Judge Evaluation: “yes”Result: ✓ Correct (reward = 1.0)

Running the Environment

Installation

# Install with ChromaDB dependency
prime env install wiki-search

Quick Evaluation

# Evaluate on 20 questions
prime eval run wiki-search \
  -m openai/gpt-4.1-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -n 20 \
  -r 5

Custom Configuration

# Use different embedding and judge models
prime eval run wiki-search \
  -m openai/gpt-4.1-mini \
  -a '{
    "max_turns": 15,
    "judge_model": "openai/gpt-4o-mini",
    "embed_model": "text-embedding-3-large",
    "corpus_dataset": "willcb/rare-wiki-pages"
  }' \
  -n 50 \
  -r 4

Configuration Options

ParameterDefaultDescription
max_turns10Maximum tool interactions
judge_model"gpt-4.1-mini"Model for evaluating answers
judge_base_urlOpenAI APIJudge model endpoint
judge_api_key_var"OPENAI_API_KEY"Env var for judge API key
embed_model"text-embedding-3-small"Embedding model for search
embed_base_urlOpenAI APIEmbedding endpoint
embed_api_key_var"OPENAI_API_KEY"Env var for embedding API key
corpus_dataset"willcb/rare-wiki-pages"Wikipedia corpus
chroma_db_dir".chroma_db"ChromaDB storage path

Key Features

Tool Schema Generation

Verifiers automatically extracts tool schemas from Python functions:
async def search_pages(query: str) -> list[dict]:
    """Search for top 10 relevant articles using title embedding similarity.

    args:
        query (str): The query to search for.

    returns:
        list[dict]: A list of dicts with page_id and title.
    """
Generated schema:
{
  "name": "search_pages",
  "description": "Search for top 10 relevant articles using title embedding similarity.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The query to search for."
      }
    },
    "required": ["query"]
  }
}

Concurrent Search with Semaphores

Manage concurrent ChromaDB access:
_chroma_semaphore = asyncio.Semaphore(100)  # Max 100 concurrent queries

async with _chroma_semaphore:
    results = await asyncio.to_thread(
        collection.query, query_texts=[query], n_results=10
    )
Prevents overwhelming the embedding API or ChromaDB.

Persistent Vector Storage

ChromaDB persists embeddings to disk:
  • First run: Embeds all Wikipedia titles (~slow)
  • Subsequent runs: Loads from .chroma_db/ (~fast)
  • Automatic incremental updates for new pages

Metrics Tracked

  • judge_reward: 1.0 if judge says “yes”, 0.0 otherwise
  • num_turns: Number of tool interactions
  • total_tool_calls: Total tools called
  • search_pages_calls: Number of search operations
  • view_sections_calls: Number of section list operations
  • read_section_calls: Number of section reads

Advanced Usage

Custom Wikipedia Corpus

Use your own Wikipedia subset:
env = load_environment(
    corpus_dataset="your-username/custom-wiki-corpus",
    corpus_split="train",
)
Required format:
{
    "id": "page_id",
    "title": "Page Title",
    "content": "# Section 1\n\nContent...\n\n# Section 2\n\n..."
}

Different Embedding Models

Use alternative embedding models:
# OpenAI alternatives
env = load_environment(
    embed_model="text-embedding-3-large",  # Higher quality
)

# Or use sentence-transformers (requires custom setup)
# See ChromaDB docs for details

Custom Judge Prompts

Modify evaluation criteria:
CUSTOM_JUDGE_PROMPT = """Does the response contain the correct answer?

Question: {question}
Expected: {answer}
Response: {response}

Answer 'yes' only if the response contains factually correct information 
matching the expected answer. Ignore formatting differences.
"""

judge_rubric = JudgeRubric(
    judge_model="gpt-4.1-mini",
    judge_prompt=CUSTOM_JUDGE_PROMPT,
)

Adding Tool Call Efficiency Rewards

Reward efficient tool usage:
async def efficiency_reward(state) -> float:
    """Reward finding answer with fewer tool calls."""
    tool_calls = state.get("total_tool_calls", 0)
    if tool_calls <= 3:
        return 0.5  # Bonus for quick answers
    return 0.0

judge_rubric.add_reward_func(efficiency_reward, weight=1.0)

Next Steps

Build docs developers (and LLMs) love