Quick Start with GEPA
Optimize any text parameter — prompts, code, agent architectures — using LLM-based reflection and Pareto-efficient evolutionary search. This guide walks you through your first optimization in minutes.What You’ll Build
In this quick start, you’ll optimize a system prompt for math problems from the AIME benchmark. With just a few lines of code, you’ll see performance jump from 46.6% → 56.6% accuracy on GPT-4.1 Mini.Install GEPA
Set your API key as an environment variable:
export OPENAI_API_KEY=your_key_hereRun Your First Optimization
Here’s a complete working example that optimizes a math reasoning prompt:Expected Result: GPT-4.1 Mini accuracy improves from 46.6% → 56.6% on AIME 2025 (a +10 percentage point gain).
Understand What Happened
GEPA just ran an evolutionary optimization loop:The result object contains:
- Evaluate — The seed prompt is tested on training examples
- Reflect — An LLM analyzes failures and diagnoses why they occurred
- Mutate — New candidate prompts are generated based on the reflection
- Select — Better candidates are kept using Pareto-efficient search
- Repeat — Process continues for 150 iterations (max_metric_calls)
Unlike RL methods that need 5,000-25,000+ evaluations, GEPA achieves strong results with just 100-500 evaluations by using full execution traces instead of scalar rewards.
What You Can Optimize
GEPA isn’t limited to prompts. You can optimize any text parameter against any evaluation metric:Prompts
System prompts, instructions, few-shot examples
Code
Functions, algorithms, configurations, policies
Agent Architectures
Entire agent systems, tool descriptions, workflows
Configurations
JSON configs, YAML files, scheduling policies
Real-World Results
| Use Case | Result | Source |
|---|---|---|
| Enterprise agents | 90x cheaper than Claude Opus 4.1 | Databricks |
| ARC-AGI agent | 32% → 89% accuracy | Blog |
| Cloud scheduling | 40.2% cost savings | Blog |
| Coding agent | 55% → 82% resolve rate | Blog |
| Math reasoning | 67% → 93% on MATH | DSPy Full Program Adapter |
GEPA is in production at Shopify, Databricks, Dropbox, OpenAI, Pydantic, MLflow, Comet ML, and 50+ organizations.
Key Concepts
Understanding these concepts will help you use GEPA effectively:Pareto-Efficient Search
GEPA maintains a Pareto frontier of candidates. A candidate stays on the frontier if it’s the best at any subset of examples—even if its average score is lower. This prevents the loss of specialized solutions.Actionable Side Information (ASI)
Traditional optimizers only see pass/fail scores. GEPA reads full execution traces:- Error messages and stack traces
- Model reasoning steps
- Profiling data and timings
- Any diagnostic information you log
LLM-Based Reflection
Instead of random mutations, GEPA uses an LLM to:- Read execution traces from failed examples
- Diagnose root causes of failures
- Propose targeted improvements
- Learn from accumulated lessons across all ancestors
Configuration Tips
Choosing Models
- Task model: Your production model or a cost-effective proxy
- Reflection model: GPT-4o, Claude Opus, or o1 for best improvements
- Budget: Start with 50-100 evaluations, increase to 150-300 for complex tasks
Data Requirements
Training set size
Training set size
10-50 examples is usually sufficient. GEPA works with as few as 3 examples but more data gives better results.
- Simple tasks: 10-20 examples
- Complex tasks: 30-50 examples
- Ensure diversity: Cover edge cases and failure modes
Validation set size
Validation set size
20-30% of your total data should be held out for validation.
- Minimum: 5-10 examples
- Recommended: 10-20 examples
- Must represent real-world usage patterns
Data format
Data format
Each example needs:
- input: The input to your system
- answer or output: Expected result
- reasoning: Expected reasoning steps (for complex tasks)
- metadata: Any task-specific context
Budget Planning
Estimate your optimization cost:Next Steps
Now that you’ve run your first optimization, explore more advanced use cases:Use with DSPy
Integrate GEPA with DSPy for powerful AI pipeline optimization
Optimize Anything
Optimize code, configurations, and agent architectures
RAG Optimization
Optimize retrieval-augmented generation pipelines
Custom Adapters
Build adapters for your specific use case
Using GEPA with DSPy (Recommended)
The most powerful way to use GEPA for AI pipelines is within DSPy, where it’s available asdspy.GEPA:
See DSPy GEPA tutorials for executable notebooks with real-world examples.
Troubleshooting
No improvement or low scores
No improvement or low scores
Possible causes:
- Insufficient budget: Increase
max_metric_callsto 150-300 - Weak reflection model: Use GPT-4o or o1 instead of smaller models
- Poor seed prompt: Try a slightly better starting point
- Misaligned metric: Ensure your evaluation metric rewards desired behavior
- Double your budget and try again
- Use the strongest reflection model you can afford
- Check that your metric correctly scores examples
API rate limits or errors
API rate limits or errors
Symptoms:
- Rate limit errors from your LLM provider
- Slow optimization runs
- Reduce
max_metric_callsto fit your rate limits - Use tier-appropriate limits (OpenAI Tier 3+ recommended)
- GEPA automatically retries with exponential backoff
- Consider using local models via Ollama for task_lm
Poor generalization to validation
Poor generalization to validation
Symptoms:
- Training scores improve but validation scores don’t
- Overfitting to training examples
- Add more validation examples (10-20 minimum)
- Increase diversity in training set
- GEPA’s Pareto frontier naturally regularizes, but ensure your data represents real usage
- Check for data leakage between train and validation sets
Optimization runs too long
Optimization runs too long
Causes:
- High
max_metric_callssetting - Slow task model or evaluation function
- Start with
max_metric_calls=50for initial experiments - Use faster task models (e.g., gpt-4o-mini instead of gpt-4o)
- Reduce training set size to 20-30 examples
- Check evaluation function for bottlenecks
Learn More
GEPA Paper
Research paper with detailed methodology and results
How It Works
Deep dive into GEPA’s optimization algorithm
Use Cases
Real-world applications across industries
API Reference
Complete API documentation and configuration options
Community & Support
Discord
Join our Discord for real-time help and discussion
GitHub
Star the repo, report issues, contribute adapters
Slack
Connect with other GEPA users and contributors
Blog
Latest updates, tutorials, and case studies