Skip to main content
GEPA’s core strength is optimizing prompts through LLM-based reflection rather than gradient descent or reinforcement learning. By reading full execution traces and reasoning about failures, GEPA evolves highly effective prompts with minimal evaluations.

Key Results

AIME 2025

46.6% → 56.6% accuracy with GPT-4.1 Mini (+10 percentage points)

HotpotQA

Multi-hop retrieval optimization with detailed query generation strategies

Enterprise Tasks

3-7% performance gains across all model types at Databricks

Sample Efficient

Works with as few as 3 examples — no large training sets required

Simple Prompt Optimization

Optimize a system prompt for math problems from the AIME benchmark:
import gepa

trainset, valset, _ = gepa.examples.aime.init_dataset()

seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

print("Optimized prompt:", result.best_candidate['system_prompt'])
Result: GPT-4.1 Mini goes from 46.6% → 56.6% on AIME 2025 (+10 percentage points). The most powerful way to use GEPA for prompt optimization is within DSPy, where it’s available as dspy.GEPA:
import dspy

optimizer = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)
optimized_program = optimizer.compile(
    student=MyProgram(), 
    trainset=trainset, 
    valset=valset
)

Real-World Results with DSPy

  • MATH benchmark: 67% → 93% with DSPy Full Program optimization
  • Structured extraction: 20+ percentage point improvement in exact match accuracy
  • Contact extraction: 86% → 97% accuracy with Pydantic AI integration

Example: AIME Math Optimization

From the AIME 2025 optimization case study:
from gepa.optimize_anything import (
    optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig,
)

result = optimize_anything(
    seed_candidate={"system_prompt": "You are a helpful math assistant."},
    evaluator=evaluate_math_problem,
    dataset=trainset,
    valset=valset,
    objective="Optimize system prompt for AIME math competition problems.",
    config=GEPAConfig(
        engine=EngineConfig(max_metric_calls=350),
        reflection=ReflectionConfig(reflection_lm="openai/gpt-5"),
    ),
)

Optimization Trajectory

Starting from a simple prompt, GEPA evolves detailed problem-solving strategies:
  • Initial: “You are a helpful assistant. Answer the question.”
  • After 150 calls: Detailed instructions for base conversions, palindromes, symmetric sums, intersecting families, and more
  • Final accuracy: 46.67% → 60.00% on AIME 2025
See the full AIME prompt example below for the detailed instructions GEPA discovers.

How It Works

GEPA’s prompt optimization follows this cycle:
  1. Select a candidate from the Pareto frontier
  2. Execute on a minibatch, capturing full execution traces
  3. Reflect — LLM reads traces and diagnoses failures
  4. Mutate — Generate improved candidate informed by lessons from ancestors
  5. Accept — Add to pool if improved, update Pareto front

Actionable Side Information (ASI)

Unlike gradient-based methods, GEPA uses Actionable Side Information:
def evaluate(candidate: dict, example: dict) -> tuple[float, dict]:
    result = execute_task(candidate, example)
    return result.score, {
        "reasoning": result.chain_of_thought,
        "correct_answer": example.answer,
        "model_answer": result.answer,
        "error_type": classify_error(result, example)
    }
The LLM proposer reads this diagnostic feedback to understand why failures occur and propose targeted fixes.

Evolved Prompts

GEPA discovers detailed, domain-specific prompts. Here are examples:

AIME Prompt (Excerpt)

You will be given one math problem as plain text. Your job is to solve it 
correctly and return:

- reasoning: a concise, logically ordered solution that uses identities/structure 
  to avoid brute force, ends with a quick verification.
- answer: the final requested number/expression only (no extra words).

Domain-specific strategies:

1) Base-conversion/digit rearrangement:
- Translate positional notation correctly: in base b, (a b c)_b = a·b^2 + b·b + c
- Enforce digit ranges strictly (e.g., in base 9, digits ∈ {0,…,8})
- Use modular constraints to prune:
  • Mod 9 often collapses coefficients
  • Mod 8: 99 ≡ 3, 71 ≡ 7 ⇒ 3a ≡ 7b (mod 8)

2) Palindromes across bases:
- Bound the base length by magnitude (e.g., n < 1000 ⇒ octal has 3–4 digits)
- Characterize palindromes:
  • 3-digit octal: (A B A)_8 = 65A + 8B
  • 4-digit octal: (A B B A)_8 = 513A + 72B (with A ≥ 1)

[... continues with 6 more problem categories ...]
See the full evolved prompt in the README.md:186-340.

HotpotQA Multi-Hop Retrieval Prompt (Excerpt)

Your task is to generate a new search query optimized for the **second hop** 
of a multi-hop retrieval system.

Key Observations:
- First-hop documents often cover one entity or aspect in the question
- Remaining relevant documents often involve connected or higher-level concepts 
  mentioned in summary_1 but not explicitly asked in the original question
- The query should target these *missing*, but logically linked, documents

Practical Strategy:
- Read the summary carefully to spot references to bigger contexts or other 
  entities not covered in the first hop
- Ask yourself: "What entity or aspect does this summary hint at that could 
  answer the original question but was not found yet?"
- Formulate a precise, focused factual query targeting that entity

[...]
See the full prompt in the README.md:202-247.

Production Use Cases

Databricks: Enterprise Agents

90x cost reduction while maintaining or improving performance by optimizing enterprise agents with GEPA.
  • Open-source models + GEPA outperform Claude Opus 4.1, Claude Sonnet 4, and GPT-5
  • Consistent 3-7% performance gains across all model types
  • At 100,000 requests, serving costs represent 95%+ of AI expenditure
Read the Databricks blog →

Pydantic AI: Contact Extraction

Contact extraction improved from 86% → 97% accuracy using GEPA with Pydantic AI.
from pydantic_ai import Agent
from gepa import optimize

# Agent with optimized system prompt
agent = Agent(
    model='openai:gpt-4',
    system_prompt=optimized_prompt
)
Read the tutorial →

HuggingFace: Structured Extraction

20+ percentage point improvement in exact match accuracy for structured extraction tasks. View the cookbook →

Advantages Over RL

35x Faster

100–500 evaluations vs. 5,000–25,000+ for GRPO

Interpretable

Human-readable traces show why each prompt changed

Sample Efficient

Works with as few as 3 examples

API-Only Models

No weights access needed — works with GPT-5, Claude, Gemini

Comparison with RL Methods

From the GEPA paper:
  • GRPO (Group Relative Policy Optimization): Requires 5,000–25,000+ evaluations
  • GEPA: Achieves comparable or better results with 100–500 evaluations
  • Key insight: Reading full traces is more informative than scalar rewards

Integration Examples

MLflow Integration

import mlflow.genai

optimized_prompts = mlflow.genai.optimize_prompts(
    prompt_template="Your initial prompt",
    training_data=train_data,
    optimizer="gepa",
    max_iterations=150,
)
MLflow documentation →

Comet ML Opik

GEPA is the core optimization algorithm in Opik Agent Optimizer:
from opik.optimizers import GEPAOptimizer

optimizer = GEPAOptimizer(
    metric=your_metric,
    max_calls=150
)
optimized_agent = optimizer.optimize(agent, dataset)
Opik documentation →

Best Practices

Begin with a minimal prompt like “You are a helpful assistant.” GEPA will evolve the complexity.
Return structured feedback in your evaluator to help GEPA understand failure modes.
When you have multiple aspects to optimize (accuracy, brevity, tone), GEPA’s Pareto frontier preserves candidates that excel at different objectives.
Always provide a valset to ensure your optimized prompt generalizes to unseen examples.
For multi-step AI pipelines, use DSPy with GEPA to optimize entire programs, not just prompts.

Next Steps

Quick Start

Get started with GEPA in 5 minutes

Code Optimization

Learn about optimizing code with GEPA

DSPy Tutorials

Step-by-step DSPy + GEPA tutorials

API Reference

Complete API documentation

Build docs developers (and LLMs) love