Overview
The AnyMathsAdapter is designed for optimizing prompts for mathematical word problems of varying complexity. It:
- Works with any dataset containing math problems (GSM8K, MATH, AIME, etc.)
- Supports local models via Ollama (zero cost) or cloud APIs
- Enforces structured output with separate reasoning and answer fields
- Provides detailed feedback for incorrect solutions
Key Result: On GSM8K with ollama/gemma3:1b, GEPA improves accuracy from 9% → 38% (+29 pp) with budget of 500.
Installation
pip install gepa
# Install adapter-specific dependencies
pip install -r src/gepa/adapters/anymaths_adapter/requirements.txt
# For Ollama (local models)
# Install from: https://ollama.com
ollama pull qwen3:4b
ollama pull qwen3:8b
Quick Start
import gepa
from gepa.adapters.anymaths_adapter import AnyMathsAdapter
# Load dataset (e.g., GSM8K)
train_data = [
{
"input": "John has 5 apples. He buys 3 more. How many does he have?",
"answer": "8",
"additional_context": {}
},
# ... more examples
]
# Create adapter with Ollama (FREE)
adapter = AnyMathsAdapter(
model="ollama/qwen3:4b",
api_base="http://localhost:11434",
max_litellm_workers=4
)
# Optimize
result = gepa.optimize(
seed_candidate={
"system_prompt": """You are an AI assistant that solves mathematical word problems.
Provide step-by-step solution and final numerical answer."""
},
trainset=train_data[:50],
valset=train_data[50:100],
adapter=adapter,
max_metric_calls=500,
reflection_lm="ollama/qwen3:8b" # Larger model for reflection
)
print("Optimized prompt:")
print(result.best_candidate["system_prompt"])
Class Signature
Defined in src/gepa/adapters/anymaths_adapter/anymaths_adapter.py:31:
class AnyMathsAdapter(GEPAAdapter[AnyMathsDataInst, AnyMathsTrajectory, AnyMathsRolloutOutput]):
def __init__(
self,
model: str,
failure_score: float = 0.0,
api_base: str | None = "http://localhost:11434",
max_litellm_workers: int = 10,
)
Parameters
Model for task execution. Supports:
- Ollama:
"ollama/qwen3:4b", "ollama/gemma3:1b"
- OpenAI:
"openai/gpt-4o-mini"
- Google:
"vertex_ai/gemini-2.5-flash-lite"
- Any LiteLLM-supported provider
Score assigned when answer is incorrect or parsing fails.
api_base
str | None
default:"'http://localhost:11434'"
API base URL. Required for Ollama, None for cloud providers.
Maximum parallel workers for batch completion.
Data Types
AnyMathsDataInst
Input data structure (src/gepa/adapters/anymaths_adapter/anymaths_adapter.py:9):
class AnyMathsDataInst(TypedDict):
input: str # Math problem statement
additional_context: dict[str, str] # Extra hints/context
answer: str # Expected numerical answer (string)
AnyMathsStructuredOutput
Enforced output schema (src/gepa/adapters/anymaths_adapter/anymaths_adapter.py:24):
class AnyMathsStructuredOutput(BaseModel):
final_answer: str # Numerical answer only (no units/text)
solution_pad: str # Step-by-step solution
AnyMathsTrajectory
Execution trace (src/gepa/adapters/anymaths_adapter/anymaths_adapter.py:15):
class AnyMathsTrajectory(TypedDict):
data: AnyMathsDataInst # Original problem
full_assistant_response: str # Formatted response with reasoning
Structured Output
The adapter enforces structured JSON output:
{
"final_answer": "42",
"solution_pad": "Step 1: Calculate 20 + 22\nStep 2: Result is 42"
}
Key constraints:
final_answer: Must contain only the numerical answer (no units, no text)
solution_pad: Contains step-by-step reasoning
- Model must follow this format strictly (enforced via JSON schema)
Methods
evaluate()
Evaluates candidate on batch of math problems.
def evaluate(
self,
batch: list[AnyMathsDataInst],
candidate: dict[str, str],
capture_traces: bool = False,
) -> EvaluationBatch[AnyMathsTrajectory, AnyMathsRolloutOutput]
Implementation: src/gepa/adapters/anymaths_adapter/anymaths_adapter.py:60
Behavior
- Extracts system prompt from candidate (first value)
- For each problem:
- Sends system prompt + problem to model
- Enforces
AnyMathsStructuredOutput JSON schema
- Parses response to extract
final_answer and solution_pad
- Checks if
data["answer"] is contained in final_answer
- Returns scores (1.0 for correct, 0.0 for incorrect)
- Captures trajectories if
capture_traces=True
make_reflective_dataset()
Generates reflective dataset with detailed feedback.
def make_reflective_dataset(
self,
candidate: dict[str, str],
eval_batch: EvaluationBatch[AnyMathsTrajectory, AnyMathsRolloutOutput],
components_to_update: list[str],
) -> dict[str, list[dict[str, Any]]]
Implementation: src/gepa/adapters/anymaths_adapter/anymaths_adapter.py:130
Returns
{
"system_prompt": [
{
"Inputs": "John has 5 apples. He buys 3 more. How many?",
"Generated Outputs": "Assistant's Solution: 5 + 3 = 8\nFinal Answer: 8",
"Feedback": "The generated response is correct. The final answer is: 8"
},
# ... more examples
]
}
Dataset Preparation
Datasets should follow this schema:
{
"question": "...", # or "input"
"solution": "...", # Optional step-by-step solution
"answer": "42" # Numerical answer only
}
From Hugging Face
from datasets import load_dataset
# Load GSM8K
dataset = load_dataset("openai/gsm8k", "main")
# Convert to AnyMaths format
def convert_example(example):
# Extract numerical answer from "#### 42" format
answer = example["answer"].split("####")[-1].strip()
return {
"input": example["question"],
"answer": answer,
"additional_context": {}
}
train_data = [convert_example(ex) for ex in dataset["train"]]
val_data = [convert_example(ex) for ex in dataset["test"][:100]]
Custom Dataset
For custom datasets, ensure answers are numerical strings:
train_data = [
{
"input": "What is 2 + 2?",
"answer": "4", # Not "4 apples" or "The answer is 4"
"additional_context": {
"difficulty": "easy",
"category": "arithmetic"
}
},
# ... more examples
]
Complete Example
import gepa
from gepa.adapters.anymaths_adapter import AnyMathsAdapter
from datasets import load_dataset
# 1. Load GSM8K dataset
dataset = load_dataset("openai/gsm8k", "main")
def convert_example(example):
answer = example["answer"].split("####")[-1].strip()
return {
"input": example["question"],
"answer": answer,
"additional_context": {}
}
train_data = [convert_example(ex) for ex in dataset["train"][:50]]
val_data = [convert_example(ex) for ex in dataset["train"][50:100]]
test_data = [convert_example(ex) for ex in dataset["test"][:50]]
# 2. Create seed prompt
seed_prompt = """
You are an AI assistant that solves mathematical word problems.
Provide:
1. Step-by-step solution in the solution_pad field
2. Final numerical answer in the final_answer field (no units, no text)
Be precise and show your work clearly.
"""
# 3. Create adapter (Ollama - FREE)
adapter = AnyMathsAdapter(
model="ollama/qwen3:4b",
api_base="http://localhost:11434",
max_litellm_workers=4,
failure_score=0.0
)
# 4. Optimize
result = gepa.optimize(
seed_candidate={"system_prompt": seed_prompt},
trainset=train_data,
valset=val_data,
adapter=adapter,
max_metric_calls=500,
reflection_lm="ollama/qwen3:8b"
)
# 5. Evaluate on test set
from gepa.adapters.anymaths_adapter import AnyMathsAdapter
test_adapter = AnyMathsAdapter(
model="ollama/qwen3:4b",
api_base="http://localhost:11434"
)
test_result = test_adapter.evaluate(
batch=test_data,
candidate=result.best_candidate,
capture_traces=False
)
test_accuracy = sum(test_result.scores) / len(test_result.scores)
print(f"Test Accuracy: {test_accuracy:.1%}")
print(f"\nOptimized Prompt:")
print(result.best_candidate["system_prompt"])
Experimental Results
From the adapter README:
| Dataset | Base LM | Reflection LM | Accuracy Before | Accuracy After | Gain | Budget |
|---|
| GSM8K | ollama/qwen3:4b | ollama/qwen3:8b | 18% | 23% | +5 pp | 500 |
| GSM8K | vertex_ai/gemini-2.5-flash-lite | vertex_ai/gemini-2.5-flash | 31% | 33% | +2 pp | 500 |
| GSM8K | ollama/qwen3:0.6b | ollama/qwen3:8b | 7% | 5% | -2 pp | 500 |
| GSM8K | ollama/gemma3:1b | ollama/gemma3:4b | 9% | 38% | +29 pp | 500 |
Best result: +29 percentage points improvement on GSM8K with ollama/gemma3:1b.
Model-Specific Tips
Small Models (< 1B parameters)
Smaller models struggle with structured output:
# Use very explicit seed prompt
seed_prompt = """
You MUST respond with valid JSON in this exact format:
{
"final_answer": "<number>",
"solution_pad": "<step by step>"
}
The final_answer field must contain ONLY the numerical answer.
No units, no text, no explanations.
"""
adapter = AnyMathsAdapter(
model="ollama/qwen3:0.6b",
api_base="http://localhost:11434"
)
Medium Models (1-4B parameters)
Good balance of cost and performance:
# Works well with moderate guidance
adapter = AnyMathsAdapter(
model="ollama/qwen3:4b", # Sweet spot
api_base="http://localhost:11434",
max_litellm_workers=4
)
# Use larger model for reflection
result = gepa.optimize(
...,
reflection_lm="ollama/qwen3:8b" # 2x larger for better reflection
)
Cloud Models
For production use:
# Google Vertex AI
adapter = AnyMathsAdapter(
model="vertex_ai/gemini-2.5-flash-lite",
api_base=None, # Uses default Vertex AI endpoint
max_litellm_workers=10
)
result = gepa.optimize(
...,
reflection_lm="vertex_ai/gemini-2.5-flash" # Stronger for reflection
)
# OpenAI
adapter = AnyMathsAdapter(
model="openai/gpt-4o-mini",
api_base=None,
max_litellm_workers=10
)
Prompt Evolution Patterns
GEPA discovers interesting patterns in optimal prompts:
For Small Models
- Goal-oriented: Clearly states the task objective
- Chain-of-Thought: Breaks down problem-solving into numbered steps
- Instruction Detail: Specific guidance on parsing problems and applying formulas
- Few-shot Learning: Concrete examples of different problem types
- Knowledge Base: Mini-rulebook with common pitfalls and edge cases
- Structured Output: Strict output format specification
For Provider Models
- Concise: Fewer tokens, more direct instructions
- Straightforward: Main instruction and output format at the top
- Structured Guidelines: Detailed guidelines follow main instruction
See full prompt examples in the adapter README.
Cost Comparison
Ollama (Local - FREE)
# Total cost: $0.00
adapter = AnyMathsAdapter(
model="ollama/qwen3:4b",
api_base="http://localhost:11434"
)
result = gepa.optimize(
...,
reflection_lm="ollama/qwen3:8b",
max_metric_calls=500
)
Requirements:
- Install Ollama locally
- Download models (3-4GB each)
- ~8GB RAM minimum
OpenAI API
# Approximate cost: $5-10 for 500 calls
adapter = AnyMathsAdapter(
model="openai/gpt-4o-mini", # $0.15/1M input tokens
api_base=None
)
result = gepa.optimize(
...,
reflection_lm="openai/gpt-4", # Proposal only (~10-20 calls)
max_metric_calls=500
)
Google Vertex AI
# Approximate cost: $2-5 for 500 calls
adapter = AnyMathsAdapter(
model="vertex_ai/gemini-2.5-flash-lite",
api_base=None
)
result = gepa.optimize(
...,
reflection_lm="vertex_ai/gemini-2.5-flash",
max_metric_calls=500
)
Best Practices
- Start Small: Test with 10-20 examples before full optimization
- Answer Format: Ensure answers are purely numerical strings
- Budget: Use 500 calls for small models, 200-300 for larger models
- Reflection LM: Use a model 2-4x larger than task model
- Local First: Develop with Ollama, deploy with cloud APIs
- Test Set: Always evaluate on held-out test set
Troubleshooting
JSON Parsing Errors
# Model not following structured output
# Solution: Make seed prompt more explicit
seed_prompt = """
You MUST respond with JSON in this EXACT format:
{
"final_answer": "42",
"solution_pad": "Step 1: ..."
}
Do NOT include any text outside the JSON.
The final_answer must be ONLY a number.
"""
Low Accuracy
# Check if answer matching is too strict
# Try substring matching vs exact matching
# Adapter already uses substring matching:
# score = 1.0 if data["answer"] in assistant_response["final_answer"] else 0.0
# If still failing, check answer format in dataset
print(train_data[0]["answer"]) # Should be "42", not "42 apples"
Ollama Connection Error
# Check Ollama is running
curl http://localhost:11434/api/tags
# Start Ollama if needed
ollama serve
# Verify model is downloaded
ollama list
Advanced Usage
Mixed Provider Strategy
# Use cheap model for task, expensive for reflection
adapter = AnyMathsAdapter(
model="ollama/qwen3:4b", # Free local
api_base="http://localhost:11434"
)
result = gepa.optimize(
...,
reflection_lm="openai/gpt-4" # Pay only for proposals
)
Domain-Specific Context
train_data = [
{
"input": "A train travels at 60 mph for 2 hours...",
"answer": "120",
"additional_context": {
"domain": "physics",
"concept": "speed-distance-time",
"difficulty": "medium"
}
},
# ... more examples
]
# Additional context is used in feedback but not shown to model
See Also