Overview
The Wiki Search environment provides:- Tools: Three custom Wikipedia search tools
- Search: Semantic search using embeddings (ChromaDB + OpenAI)
- Dataset: Wikipedia trivia questions
- Evaluation: LLM judge for flexible answer matching
- Format: Multi-turn tool use
Complete Implementation
Here’s the full working implementation fromenvironments/wiki_search/wiki_search.py:
- Main (Part 1)
- ChromaDB Setup
- Tools (Part 1)
- Tools (Part 2)
- Tools (Part 3)
- Judge & Environment
How It Works
1. Custom Tool Implementation
Three tools provide Wikipedia access:search_pages(query) - Semantic search for relevant articles
- Uses embedding similarity (text-embedding-3-small)
- Returns top 10 matching pages with IDs and titles
view_sections(page_id) - List sections of an article
- Parses markdown headers to extract sections
- Returns section IDs and names
read_section(section_id) - Read section content
- Extracts text between markdown headers
- Returns full content of the section
2. Vector Search with ChromaDB
Embedding-based search enables semantic matching:- Persistent storage (
.chroma_dbdirectory) - Lazy initialization (loads on first search)
- Automatic upsert of missing pages
- Concurrent access via semaphore (max 100 parallel)
3. LLM Judge Evaluation
JudgeRubric uses an LLM to evaluate answers:
- Trivia answers have multiple valid phrasings
- Example: “Barack Obama” = “Obama” = “Barack H. Obama”
- Judge can handle paraphrasing and context
4. Tool Environment
ToolEnv provides tool calling capabilities:
- Extracts tool schemas from function signatures
- Executes tool calls and returns results
- Tracks tool usage metrics
Example Interaction
- Full Trajectory
- Dataset Sample
- Tool Usage Pattern
Question: Who painted the Mona Lisa?
Assistant: I’ll search for information about the Mona Lisa painting.Tool Call:
Assistant: Let me view the sections of the Mona Lisa article.Tool Call:
Assistant: I’ll read the artist section.Tool Call:
Assistant: Based on the Wikipedia article, Leonardo da Vinci painted the Mona Lisa.Judge Evaluation: “yes”Result: ✓ Correct (reward = 1.0)
Assistant: I’ll search for information about the Mona Lisa painting.Tool Call:
search_pages("Mona Lisa painting")Tool Result:Assistant: Let me view the sections of the Mona Lisa article.Tool Call:
view_sections("mona_lisa")Tool Result:Assistant: I’ll read the artist section.Tool Call:
read_section("mona_lisa:artist")Tool Result:Assistant: Based on the Wikipedia article, Leonardo da Vinci painted the Mona Lisa.Judge Evaluation: “yes”Result: ✓ Correct (reward = 1.0)
Running the Environment
Installation
Quick Evaluation
Custom Configuration
Configuration Options
| Parameter | Default | Description |
|---|---|---|
max_turns | 10 | Maximum tool interactions |
judge_model | "gpt-4.1-mini" | Model for evaluating answers |
judge_base_url | OpenAI API | Judge model endpoint |
judge_api_key_var | "OPENAI_API_KEY" | Env var for judge API key |
embed_model | "text-embedding-3-small" | Embedding model for search |
embed_base_url | OpenAI API | Embedding endpoint |
embed_api_key_var | "OPENAI_API_KEY" | Env var for embedding API key |
corpus_dataset | "willcb/rare-wiki-pages" | Wikipedia corpus |
chroma_db_dir | ".chroma_db" | ChromaDB storage path |
Key Features
Tool Schema Generation
Verifiers automatically extracts tool schemas from Python functions:Concurrent Search with Semaphores
Manage concurrent ChromaDB access:Persistent Vector Storage
ChromaDB persists embeddings to disk:- First run: Embeds all Wikipedia titles (~slow)
- Subsequent runs: Loads from
.chroma_db/(~fast) - Automatic incremental updates for new pages
Metrics Tracked
judge_reward: 1.0 if judge says “yes”, 0.0 otherwisenum_turns: Number of tool interactionstotal_tool_calls: Total tools calledsearch_pages_calls: Number of search operationsview_sections_calls: Number of section list operationsread_section_calls: Number of section reads
Advanced Usage
Custom Wikipedia Corpus
Use your own Wikipedia subset:Different Embedding Models
Use alternative embedding models:Custom Judge Prompts
Modify evaluation criteria:Adding Tool Call Efficiency Rewards
Reward efficient tool usage:Related Examples
- GSM8K - Single-turn without tools
- Math Python - Code execution tools
- Wordle - Multi-turn game environment
- Browser Examples - Complex stateful tools
Next Steps
- Learn about ToolEnv architecture
- See Custom Tools for implementing your own
- Explore JudgeRubric for LLM evaluation
- Check out ChromaDB for vector storage