Math Problem Optimization Tutorial
Learn how to use GEPA to optimize system prompts for mathematical reasoning tasks. In this tutorial, we’ll improve GPT-4.1 Mini’s performance on AIME (American Invitational Mathematics Examination) problems from 46.6% to 56.6% accuracy through prompt optimization alone.Overview
This tutorial demonstrates:- Training on AIME 2022-2024 problems
- Testing on AIME 2025 (held-out set)
- Optimizing system prompts for complex mathematical reasoning
- Achieving significant gains without model fine-tuning
Load the Dataset
GEPA provides a built-in AIME dataset loader:The dataset includes:
- Training: AIME validation problems (AI-MO/aimo-validation-aime)
- Test: AIME 2025 problems (MathArena/aime_2025)
- Each example contains the problem, solution, and answer in format
### <answer>
Define the Seed Prompt
Start with a basic system prompt:This simple prompt serves as our baseline. GEPA will evolve it into a detailed, strategy-rich prompt.
Run GEPA Optimization
Optimize the prompt using GEPA:Key Parameters:
task_lm: The model being optimized (GPT-4.1 Mini)reflection_lm: The model generating improved prompts (GPT-5)max_metric_calls: Number of optimization iterations (150)
Understand the Results
Performance Improvement:
- Baseline: 46.6% accuracy with simple prompt
- Optimized: 56.6% accuracy with GEPA-optimized prompt
- Gain: +10 percentage points from prompt optimization alone
- Domain-specific strategies for base conversion, palindromes, symmetric sums
- Step-by-step problem-solving guidance
- Common pitfall warnings and verification steps
- Structured output format requirements
Key Takeaways
Significant Gains
10 percentage point improvement on AIME 2025 from prompt optimization alone, without fine-tuning or architectural changes.
Domain Knowledge
GEPA automatically discovers problem-solving strategies, common pitfalls, and verification steps specific to mathematical reasoning.
Generalization
The optimized prompt generalizes to unseen AIME 2025 problems after training on 2022-2024 data.
Efficient Search
Achieves strong results with just 150 optimization iterations, far fewer than RL-based methods (5,000-25,000+ evaluations).
Advanced Usage
Use with DSPy
For more complex AI pipelines, integrate GEPA through DSPy:Custom Metrics
Define custom evaluation metrics for your domain:Related Resources
GEPA Paper
Read the full research paper on reflective prompt evolution
DSPy Tutorials
Complete AIME tutorial with executable notebooks
Simple Prompt Tutorial
Learn basic prompt optimization concepts
Agent Architecture
Optimize entire agent systems, not just prompts
Next Steps
- Try optimizing prompts for other mathematical benchmarks (MATH, GSM8K)
- Experiment with different reflection models and budgets
- Combine with other optimization techniques for even better results
- Explore the full DSPy Program adapter for optimizing entire reasoning chains