Prompt Optimization

GEPA’s core strength is optimizing prompts through LLM-based reflection rather than gradient descent or reinforcement learning. By reading full execution traces and reasoning about failures, GEPA evolves highly effective prompts with minimal evaluations.

Key Results

AIME 2025

46.6% → 56.6% accuracy with GPT-4.1 Mini (+10 percentage points)

HotpotQA

Multi-hop retrieval optimization with detailed query generation strategies

Enterprise Tasks

3-7% performance gains across all model types at Databricks

Sample Efficient

Works with as few as 3 examples — no large training sets required

Simple Prompt Optimization

Optimize a system prompt for math problems from the AIME benchmark:

import gepa

trainset, valset, _ = gepa.examples.aime.init_dataset()

seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

print("Optimized prompt:", result.best_candidate['system_prompt'])

Result: GPT-4.1 Mini goes from 46.6% → 56.6% on AIME 2025 (+10 percentage points).

With DSPy (Recommended)

The most powerful way to use GEPA for prompt optimization is within DSPy, where it’s available as dspy.GEPA:

import dspy

optimizer = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)
optimized_program = optimizer.compile(
    student=MyProgram(), 
    trainset=trainset, 
    valset=valset
)

Real-World Results with DSPy

MATH benchmark: 67% → 93% with DSPy Full Program optimization
Structured extraction: 20+ percentage point improvement in exact match accuracy
Contact extraction: 86% → 97% accuracy with Pydantic AI integration

Example: AIME Math Optimization

From the AIME 2025 optimization case study:

from gepa.optimize_anything import (
    optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig,
)

result = optimize_anything(
    seed_candidate={"system_prompt": "You are a helpful math assistant."},
    evaluator=evaluate_math_problem,
    dataset=trainset,
    valset=valset,
    objective="Optimize system prompt for AIME math competition problems.",
    config=GEPAConfig(
        engine=EngineConfig(max_metric_calls=350),
        reflection=ReflectionConfig(reflection_lm="openai/gpt-5"),
    ),
)

Optimization Trajectory

Starting from a simple prompt, GEPA evolves detailed problem-solving strategies:

Initial: “You are a helpful assistant. Answer the question.”
After 150 calls: Detailed instructions for base conversions, palindromes, symmetric sums, intersecting families, and more
Final accuracy: 46.67% → 60.00% on AIME 2025

See the full AIME prompt example below for the detailed instructions GEPA discovers.

How It Works

GEPA’s prompt optimization follows this cycle:

Select a candidate from the Pareto frontier
Execute on a minibatch, capturing full execution traces
Reflect — LLM reads traces and diagnoses failures
Mutate — Generate improved candidate informed by lessons from ancestors
Accept — Add to pool if improved, update Pareto front

Actionable Side Information (ASI)

Unlike gradient-based methods, GEPA uses Actionable Side Information:

def evaluate(candidate: dict, example: dict) -> tuple[float, dict]:
    result = execute_task(candidate, example)
    return result.score, {
        "reasoning": result.chain_of_thought,
        "correct_answer": example.answer,
        "model_answer": result.answer,
        "error_type": classify_error(result, example)
    }

The LLM proposer reads this diagnostic feedback to understand why failures occur and propose targeted fixes.

Evolved Prompts

GEPA discovers detailed, domain-specific prompts. Here are examples:

AIME Prompt (Excerpt)

You will be given one math problem as plain text. Your job is to solve it 
correctly and return:

- reasoning: a concise, logically ordered solution that uses identities/structure 
  to avoid brute force, ends with a quick verification.
- answer: the final requested number/expression only (no extra words).

Domain-specific strategies:

1) Base-conversion/digit rearrangement:
- Translate positional notation correctly: in base b, (a b c)_b = a·b^2 + b·b + c
- Enforce digit ranges strictly (e.g., in base 9, digits ∈ {0,…,8})
- Use modular constraints to prune:
  • Mod 9 often collapses coefficients
  • Mod 8: 99 ≡ 3, 71 ≡ 7 ⇒ 3a ≡ 7b (mod 8)

2) Palindromes across bases:
- Bound the base length by magnitude (e.g., n < 1000 ⇒ octal has 3–4 digits)
- Characterize palindromes:
  • 3-digit octal: (A B A)_8 = 65A + 8B
  • 4-digit octal: (A B B A)_8 = 513A + 72B (with A ≥ 1)

[... continues with 6 more problem categories ...]

See the full evolved prompt in the README.md:186-340.

HotpotQA Multi-Hop Retrieval Prompt (Excerpt)

Your task is to generate a new search query optimized for the **second hop** 
of a multi-hop retrieval system.

Key Observations:
- First-hop documents often cover one entity or aspect in the question
- Remaining relevant documents often involve connected or higher-level concepts 
  mentioned in summary_1 but not explicitly asked in the original question
- The query should target these *missing*, but logically linked, documents

Practical Strategy:
- Read the summary carefully to spot references to bigger contexts or other 
  entities not covered in the first hop
- Ask yourself: "What entity or aspect does this summary hint at that could 
  answer the original question but was not found yet?"
- Formulate a precise, focused factual query targeting that entity

[...]

See the full prompt in the README.md:202-247.

Production Use Cases

Databricks: Enterprise Agents

90x cost reduction while maintaining or improving performance by optimizing enterprise agents with GEPA.

Open-source models + GEPA outperform Claude Opus 4.1, Claude Sonnet 4, and GPT-5
Consistent 3-7% performance gains across all model types
At 100,000 requests, serving costs represent 95%+ of AI expenditure

Read the Databricks blog →

Pydantic AI: Contact Extraction

Contact extraction improved from 86% → 97% accuracy using GEPA with Pydantic AI.

from pydantic_ai import Agent
from gepa import optimize

# Agent with optimized system prompt
agent = Agent(
    model='openai:gpt-4',
    system_prompt=optimized_prompt
)

Read the tutorial →

HuggingFace: Structured Extraction

20+ percentage point improvement in exact match accuracy for structured extraction tasks. View the cookbook →

Advantages Over RL

35x Faster

100–500 evaluations vs. 5,000–25,000+ for GRPO

Interpretable

Human-readable traces show why each prompt changed

Sample Efficient

Works with as few as 3 examples

API-Only Models

No weights access needed — works with GPT-5, Claude, Gemini

Comparison with RL Methods

From the GEPA paper:

GRPO (Group Relative Policy Optimization): Requires 5,000–25,000+ evaluations
GEPA: Achieves comparable or better results with 100–500 evaluations
Key insight: Reading full traces is more informative than scalar rewards

Integration Examples

MLflow Integration

import mlflow.genai

optimized_prompts = mlflow.genai.optimize_prompts(
    prompt_template="Your initial prompt",
    training_data=train_data,
    optimizer="gepa",
    max_iterations=150,
)

MLflow documentation →

Comet ML Opik

GEPA is the core optimization algorithm in Opik Agent Optimizer:

from opik.optimizers import GEPAOptimizer

optimizer = GEPAOptimizer(
    metric=your_metric,
    max_calls=150
)
optimized_agent = optimizer.optimize(agent, dataset)

Opik documentation →

Best Practices

Start with a simple seed

Begin with a minimal prompt like “You are a helpful assistant.” GEPA will evolve the complexity.

Use informative evaluation metrics

Return structured feedback in your evaluator to help GEPA understand failure modes.

Leverage Pareto selection

When you have multiple aspects to optimize (accuracy, brevity, tone), GEPA’s Pareto frontier preserves candidates that excel at different objectives.

Use validation sets for generalization

Always provide a valset to ensure your optimized prompt generalizes to unseen examples.

Combine with DSPy for complex pipelines

For multi-step AI pipelines, use DSPy with GEPA to optimize entire programs, not just prompts.

Next Steps

Quick Start

Get started with GEPA in 5 minutes

Code Optimization

Learn about optimizing code with GEPA

DSPy Tutorials

Step-by-step DSPy + GEPA tutorials

API Reference

Complete API documentation

Get Started

Core Concepts

Guides

Use Cases

Prompt Optimization

Key Results

AIME 2025

HotpotQA

Enterprise Tasks

Sample Efficient

Simple Prompt Optimization

With DSPy (Recommended)

Real-World Results with DSPy

Example: AIME Math Optimization

Optimization Trajectory

How It Works

Actionable Side Information (ASI)

Evolved Prompts

AIME Prompt (Excerpt)

HotpotQA Multi-Hop Retrieval Prompt (Excerpt)

Production Use Cases

Databricks: Enterprise Agents

Pydantic AI: Contact Extraction

HuggingFace: Structured Extraction

Advantages Over RL

35x Faster

Interpretable

Sample Efficient

API-Only Models

Comparison with RL Methods

Integration Examples

MLflow Integration

Comet ML Opik

Best Practices

Next Steps

Quick Start

Code Optimization

DSPy Tutorials

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Use Cases

​Key Results

AIME 2025

HotpotQA

Enterprise Tasks

Sample Efficient

​Simple Prompt Optimization

​With DSPy (Recommended)

​Real-World Results with DSPy

​Example: AIME Math Optimization

​Optimization Trajectory

​How It Works

​Actionable Side Information (ASI)

​Evolved Prompts

​AIME Prompt (Excerpt)

​HotpotQA Multi-Hop Retrieval Prompt (Excerpt)

​Production Use Cases

​Databricks: Enterprise Agents

​Pydantic AI: Contact Extraction

​HuggingFace: Structured Extraction

​Advantages Over RL

35x Faster

Interpretable

Sample Efficient

API-Only Models

​Comparison with RL Methods

​Integration Examples

​MLflow Integration

​Comet ML Opik

​Best Practices

​Next Steps

Quick Start

Code Optimization

DSPy Tutorials

API Reference

Build docs developers (and LLMs) love

Key Results

Simple Prompt Optimization

With DSPy (Recommended)

Real-World Results with DSPy

Example: AIME Math Optimization

Optimization Trajectory

How It Works

Actionable Side Information (ASI)

Evolved Prompts

AIME Prompt (Excerpt)

HotpotQA Multi-Hop Retrieval Prompt (Excerpt)

Production Use Cases

Databricks: Enterprise Agents

Pydantic AI: Contact Extraction

HuggingFace: Structured Extraction

Advantages Over RL

Comparison with RL Methods

Integration Examples

MLflow Integration

Comet ML Opik

Best Practices

Next Steps