Skip to main content

Quick Start with GEPA

Optimize any text parameter — prompts, code, agent architectures — using LLM-based reflection and Pareto-efficient evolutionary search. This guide walks you through your first optimization in minutes.

What You’ll Build

In this quick start, you’ll optimize a system prompt for math problems from the AIME benchmark. With just a few lines of code, you’ll see performance jump from 46.6% → 56.6% accuracy on GPT-4.1 Mini.
1

Install GEPA

pip install gepa
GEPA works with any LLM provider through LiteLLM (OpenAI, Anthropic, local models, etc.). You’ll need an API key for the models you want to use.
Set your API key as an environment variable: export OPENAI_API_KEY=your_key_here
2

Run Your First Optimization

Here’s a complete working example that optimizes a math reasoning prompt:
import gepa

# Load the AIME math dataset (built-in example)
trainset, valset, _ = gepa.examples.aime.init_dataset()

# Start with a basic prompt
seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

# Optimize the prompt
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",      # Model being optimized
    max_metric_calls=150,                # Number of evaluations
    reflection_lm="openai/gpt-5",       # Model that generates improvements
)

print("Optimized prompt:", result.best_candidate['system_prompt'])
Expected Result: GPT-4.1 Mini accuracy improves from 46.6% → 56.6% on AIME 2025 (a +10 percentage point gain).
import gepa

# Load dataset
trainset, valset, _ = gepa.examples.aime.init_dataset()

# Define seed prompt
seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

# Run optimization
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

# Review results
print("=" * 60)
print("OPTIMIZATION COMPLETE")
print("=" * 60)
print(f"\nOriginal prompt:")
print(f'  "{seed_prompt["system_prompt"]}"')
print(f"\nOptimized prompt:")
print(f'  "{result.best_candidate["system_prompt"]}"')
print(f"\nValidation score: {result.val_aggregate_scores[result.best_idx]:.1%}")
print(f"Improvement: +{(result.val_aggregate_scores[result.best_idx] - result.val_aggregate_scores[0]) * 100:.1f} points")
3

Understand What Happened

GEPA just ran an evolutionary optimization loop:
  1. Evaluate — The seed prompt is tested on training examples
  2. Reflect — An LLM analyzes failures and diagnoses why they occurred
  3. Mutate — New candidate prompts are generated based on the reflection
  4. Select — Better candidates are kept using Pareto-efficient search
  5. Repeat — Process continues for 150 iterations (max_metric_calls)
Unlike RL methods that need 5,000-25,000+ evaluations, GEPA achieves strong results with just 100-500 evaluations by using full execution traces instead of scalar rewards.
The result object contains:
# Best optimized prompt
best_prompt = result.best_candidate['system_prompt']

# All validation scores
val_scores = result.val_aggregate_scores

# Index of best candidate
best_idx = result.best_idx

# Total evaluations used
total_calls = result.total_metric_calls

print(f"Tried {len(val_scores)} candidates in {total_calls} evaluations")
print(f"Best validation score: {val_scores[best_idx]:.3f}")
4

Use Your Optimized Prompt

Deploy the optimized prompt in your application:
import litellm

def solve_math_problem(problem: str, optimized_prompt: dict) -> str:
    """Use the optimized prompt to solve math problems"""
    response = litellm.completion(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": optimized_prompt["system_prompt"]},
            {"role": "user", "content": problem}
        ]
    )
    return response.choices[0].message.content

# Use it on new problems
answer = solve_math_problem(
    "If x² + 2x - 3 = 0, what are the possible values of x?",
    result.best_candidate
)
print(answer)
Always validate your optimized prompts on held-out test data before production deployment to ensure they generalize well.

What You Can Optimize

GEPA isn’t limited to prompts. You can optimize any text parameter against any evaluation metric:

Prompts

System prompts, instructions, few-shot examples

Code

Functions, algorithms, configurations, policies

Agent Architectures

Entire agent systems, tool descriptions, workflows

Configurations

JSON configs, YAML files, scheduling policies

Real-World Results

Use CaseResultSource
Enterprise agents90x cheaper than Claude Opus 4.1Databricks
ARC-AGI agent32% → 89% accuracyBlog
Cloud scheduling40.2% cost savingsBlog
Coding agent55% → 82% resolve rateBlog
Math reasoning67% → 93% on MATHDSPy Full Program Adapter
GEPA is in production at Shopify, Databricks, Dropbox, OpenAI, Pydantic, MLflow, Comet ML, and 50+ organizations.

Key Concepts

Understanding these concepts will help you use GEPA effectively: GEPA maintains a Pareto frontier of candidates. A candidate stays on the frontier if it’s the best at any subset of examples—even if its average score is lower. This prevents the loss of specialized solutions.
# Example: Two candidates on the Pareto frontier
Candidate A: [100%, 100%, 0%, 0%]    # Avg: 50%
Candidate B: [60%, 60%, 60%, 60%]    # Avg: 60%

# Both are kept! A excels on examples 1-2, B excels on examples 3-4
Learn more about Pareto optimization →

Actionable Side Information (ASI)

Traditional optimizers only see pass/fail scores. GEPA reads full execution traces:
  • Error messages and stack traces
  • Model reasoning steps
  • Profiling data and timings
  • Any diagnostic information you log
This rich feedback is the text-optimization analogue of a gradient, enabling targeted improvements.
import gepa.optimize_anything as oa

def evaluate(candidate: str) -> float:
    result = run_system(candidate)
    oa.log(f"Output: {result.output}")     # Feeds into reflection
    oa.log(f"Error: {result.error}")       # LLM analyzes this
    oa.log(f"Timing: {result.time}ms")     # All diagnostic info
    return result.score
Learn more about ASI →

LLM-Based Reflection

Instead of random mutations, GEPA uses an LLM to:
  1. Read execution traces from failed examples
  2. Diagnose root causes of failures
  3. Propose targeted improvements
  4. Learn from accumulated lessons across all ancestors
This reflection process is why GEPA needs far fewer evaluations than gradient-free or RL methods. Learn more about reflective evolution →

Configuration Tips

Choosing Models

result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    # Task model: The model you're optimizing FOR
    # Use your production model or a cheaper alternative
    task_lm="openai/gpt-4o-mini",
    
    # Reflection model: The model that generates improvements
    # Use the smartest model you can afford
    reflection_lm="openai/gpt-4o",  # or "openai/o1" for best results
    
    max_metric_calls=100,
)
Recommendations:
  • Task model: Your production model or a cost-effective proxy
  • Reflection model: GPT-4o, Claude Opus, or o1 for best improvements
  • Budget: Start with 50-100 evaluations, increase to 150-300 for complex tasks

Data Requirements

10-50 examples is usually sufficient. GEPA works with as few as 3 examples but more data gives better results.
  • Simple tasks: 10-20 examples
  • Complex tasks: 30-50 examples
  • Ensure diversity: Cover edge cases and failure modes
20-30% of your total data should be held out for validation.
  • Minimum: 5-10 examples
  • Recommended: 10-20 examples
  • Must represent real-world usage patterns
Each example needs:
  • input: The input to your system
  • answer or output: Expected result
Optional but recommended:
  • reasoning: Expected reasoning steps (for complex tasks)
  • metadata: Any task-specific context

Budget Planning

Estimate your optimization cost:
# Cost per optimization run
cost_per_eval = task_model_cost  # e.g., $0.001 for gpt-4o-mini
reflection_cost = reflection_model_cost  # e.g., $0.01 for gpt-4o

# Total cost = (evaluations × task_cost) + (proposals × reflection_cost)
# Proposals ≈ evaluations / 2 (GEPA proposes in batches)

evaluations = 150
estimated_cost = (evaluations * 0.001) + (evaluations / 2 * 0.01)
# ≈ $0.15 + $0.75 = $0.90 per optimization run
Monitor your API usage and costs during optimization. Set max_metric_calls based on your budget and API rate limits.

Next Steps

Now that you’ve run your first optimization, explore more advanced use cases:

Use with DSPy

Integrate GEPA with DSPy for powerful AI pipeline optimization

Optimize Anything

Optimize code, configurations, and agent architectures

RAG Optimization

Optimize retrieval-augmented generation pipelines

Custom Adapters

Build adapters for your specific use case
The most powerful way to use GEPA for AI pipelines is within DSPy, where it’s available as dspy.GEPA:
import dspy

# Define your DSPy program
class MyProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question):
        return self.generate(question=question)

# Optimize with GEPA
optimizer = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

optimized_program = optimizer.compile(
    student=MyProgram(),
    trainset=trainset,
    valset=valset
)
See DSPy GEPA tutorials for executable notebooks with real-world examples.

Troubleshooting

Possible causes:
  • Insufficient budget: Increase max_metric_calls to 150-300
  • Weak reflection model: Use GPT-4o or o1 instead of smaller models
  • Poor seed prompt: Try a slightly better starting point
  • Misaligned metric: Ensure your evaluation metric rewards desired behavior
Solutions:
  • Double your budget and try again
  • Use the strongest reflection model you can afford
  • Check that your metric correctly scores examples
Symptoms:
  • Rate limit errors from your LLM provider
  • Slow optimization runs
Solutions:
  • Reduce max_metric_calls to fit your rate limits
  • Use tier-appropriate limits (OpenAI Tier 3+ recommended)
  • GEPA automatically retries with exponential backoff
  • Consider using local models via Ollama for task_lm
Symptoms:
  • Training scores improve but validation scores don’t
  • Overfitting to training examples
Solutions:
  • Add more validation examples (10-20 minimum)
  • Increase diversity in training set
  • GEPA’s Pareto frontier naturally regularizes, but ensure your data represents real usage
  • Check for data leakage between train and validation sets
Causes:
  • High max_metric_calls setting
  • Slow task model or evaluation function
Solutions:
  • Start with max_metric_calls=50 for initial experiments
  • Use faster task models (e.g., gpt-4o-mini instead of gpt-4o)
  • Reduce training set size to 20-30 examples
  • Check evaluation function for bottlenecks

Learn More

GEPA Paper

Research paper with detailed methodology and results

How It Works

Deep dive into GEPA’s optimization algorithm

Use Cases

Real-world applications across industries

API Reference

Complete API documentation and configuration options

Community & Support

Discord

Join our Discord for real-time help and discussion

GitHub

Star the repo, report issues, contribute adapters

Slack

Connect with other GEPA users and contributors

Blog

Latest updates, tutorials, and case studies

Build docs developers (and LLMs) love