Quick Start with GEPA

Optimize any text parameter — prompts, code, agent architectures — using LLM-based reflection and Pareto-efficient evolutionary search. This guide walks you through your first optimization in minutes.

What You’ll Build

In this quick start, you’ll optimize a system prompt for math problems from the AIME benchmark. With just a few lines of code, you’ll see performance jump from 46.6% → 56.6% accuracy on GPT-4.1 Mini.

Install GEPA

pip install gepa

GEPA works with any LLM provider through LiteLLM (OpenAI, Anthropic, local models, etc.). You’ll need an API key for the models you want to use.

Set your API key as an environment variable: export OPENAI_API_KEY=your_key_here

Run Your First Optimization

Here’s a complete working example that optimizes a math reasoning prompt:

import gepa

# Load the AIME math dataset (built-in example)
trainset, valset, _ = gepa.examples.aime.init_dataset()

# Start with a basic prompt
seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

# Optimize the prompt
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",      # Model being optimized
    max_metric_calls=150,                # Number of evaluations
    reflection_lm="openai/gpt-5",       # Model that generates improvements
)

print("Optimized prompt:", result.best_candidate['system_prompt'])

Expected Result: GPT-4.1 Mini accuracy improves from 46.6% → 56.6% on AIME 2025 (a +10 percentage point gain).

import gepa

# Load dataset
trainset, valset, _ = gepa.examples.aime.init_dataset()

# Define seed prompt
seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

# Run optimization
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

# Review results
print("=" * 60)
print("OPTIMIZATION COMPLETE")
print("=" * 60)
print(f"\nOriginal prompt:")
print(f'  "{seed_prompt["system_prompt"]}"')
print(f"\nOptimized prompt:")
print(f'  "{result.best_candidate["system_prompt"]}"')
print(f"\nValidation score: {result.val_aggregate_scores[result.best_idx]:.1%}")
print(f"Improvement: +{(result.val_aggregate_scores[result.best_idx] - result.val_aggregate_scores[0]) * 100:.1f} points")

Understand What Happened

GEPA just ran an evolutionary optimization loop:

Evaluate — The seed prompt is tested on training examples
Reflect — An LLM analyzes failures and diagnoses why they occurred
Mutate — New candidate prompts are generated based on the reflection
Select — Better candidates are kept using Pareto-efficient search
Repeat — Process continues for 150 iterations (max_metric_calls)

Unlike RL methods that need 5,000-25,000+ evaluations, GEPA achieves strong results with just 100-500 evaluations by using full execution traces instead of scalar rewards.

The result object contains:

# Best optimized prompt
best_prompt = result.best_candidate['system_prompt']

# All validation scores
val_scores = result.val_aggregate_scores

# Index of best candidate
best_idx = result.best_idx

# Total evaluations used
total_calls = result.total_metric_calls

print(f"Tried {len(val_scores)} candidates in {total_calls} evaluations")
print(f"Best validation score: {val_scores[best_idx]:.3f}")

Use Your Optimized Prompt

Deploy the optimized prompt in your application:

import litellm

def solve_math_problem(problem: str, optimized_prompt: dict) -> str:
    """Use the optimized prompt to solve math problems"""
    response = litellm.completion(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": optimized_prompt["system_prompt"]},
            {"role": "user", "content": problem}
        ]
    )
    return response.choices[0].message.content

# Use it on new problems
answer = solve_math_problem(
    "If x² + 2x - 3 = 0, what are the possible values of x?",
    result.best_candidate
)
print(answer)

Always validate your optimized prompts on held-out test data before production deployment to ensure they generalize well.

What You Can Optimize

GEPA isn’t limited to prompts. You can optimize any text parameter against any evaluation metric:

Prompts

System prompts, instructions, few-shot examples

Code

Functions, algorithms, configurations, policies

Agent Architectures

Entire agent systems, tool descriptions, workflows

Configurations

JSON configs, YAML files, scheduling policies

Real-World Results

Use Case	Result	Source
Enterprise agents	90x cheaper than Claude Opus 4.1	Databricks
ARC-AGI agent	32% → 89% accuracy	Blog
Cloud scheduling	40.2% cost savings	Blog
Coding agent	55% → 82% resolve rate	Blog
Math reasoning	67% → 93% on MATH	DSPy Full Program Adapter

GEPA is in production at Shopify, Databricks, Dropbox, OpenAI, Pydantic, MLflow, Comet ML, and 50+ organizations.

Key Concepts

Understanding these concepts will help you use GEPA effectively:

Pareto-Efficient Search

GEPA maintains a Pareto frontier of candidates. A candidate stays on the frontier if it’s the best at any subset of examples—even if its average score is lower. This prevents the loss of specialized solutions.

# Example: Two candidates on the Pareto frontier
Candidate A: [100%, 100%, 0%, 0%]    # Avg: 50%
Candidate B: [60%, 60%, 60%, 60%]    # Avg: 60%

# Both are kept! A excels on examples 1-2, B excels on examples 3-4

Learn more about Pareto optimization →

Actionable Side Information (ASI)

Traditional optimizers only see pass/fail scores. GEPA reads full execution traces:

Error messages and stack traces
Model reasoning steps
Profiling data and timings
Any diagnostic information you log

This rich feedback is the text-optimization analogue of a gradient, enabling targeted improvements.

import gepa.optimize_anything as oa

def evaluate(candidate: str) -> float:
    result = run_system(candidate)
    oa.log(f"Output: {result.output}")     # Feeds into reflection
    oa.log(f"Error: {result.error}")       # LLM analyzes this
    oa.log(f"Timing: {result.time}ms")     # All diagnostic info
    return result.score

Learn more about ASI →

LLM-Based Reflection

Instead of random mutations, GEPA uses an LLM to:

Read execution traces from failed examples
Diagnose root causes of failures
Propose targeted improvements
Learn from accumulated lessons across all ancestors

This reflection process is why GEPA needs far fewer evaluations than gradient-free or RL methods. Learn more about reflective evolution →

Configuration Tips

Choosing Models

result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    # Task model: The model you're optimizing FOR
    # Use your production model or a cheaper alternative
    task_lm="openai/gpt-4o-mini",
    
    # Reflection model: The model that generates improvements
    # Use the smartest model you can afford
    reflection_lm="openai/gpt-4o",  # or "openai/o1" for best results
    
    max_metric_calls=100,
)

Recommendations:

Task model: Your production model or a cost-effective proxy
Reflection model: GPT-4o, Claude Opus, or o1 for best improvements
Budget: Start with 50-100 evaluations, increase to 150-300 for complex tasks

Data Requirements

Training set size

10-50 examples is usually sufficient. GEPA works with as few as 3 examples but more data gives better results.

Simple tasks: 10-20 examples
Complex tasks: 30-50 examples
Ensure diversity: Cover edge cases and failure modes

Validation set size

20-30% of your total data should be held out for validation.

Minimum: 5-10 examples
Recommended: 10-20 examples
Must represent real-world usage patterns

Data format

Each example needs:

input: The input to your system
answer or output: Expected result

Optional but recommended:

reasoning: Expected reasoning steps (for complex tasks)
metadata: Any task-specific context

Budget Planning

Estimate your optimization cost:

# Cost per optimization run
cost_per_eval = task_model_cost  # e.g., $0.001 for gpt-4o-mini
reflection_cost = reflection_model_cost  # e.g., $0.01 for gpt-4o

# Total cost = (evaluations × task_cost) + (proposals × reflection_cost)
# Proposals ≈ evaluations / 2 (GEPA proposes in batches)

evaluations = 150
estimated_cost = (evaluations * 0.001) + (evaluations / 2 * 0.01)
# ≈ $0.15 + $0.75 = $0.90 per optimization run

Monitor your API usage and costs during optimization. Set max_metric_calls based on your budget and API rate limits.

Next Steps

Now that you’ve run your first optimization, explore more advanced use cases:

Use with DSPy

Integrate GEPA with DSPy for powerful AI pipeline optimization

Optimize Anything

Optimize code, configurations, and agent architectures

RAG Optimization

Optimize retrieval-augmented generation pipelines

Custom Adapters

Build adapters for your specific use case

Using GEPA with DSPy (Recommended)

The most powerful way to use GEPA for AI pipelines is within DSPy, where it’s available as dspy.GEPA:

import dspy

# Define your DSPy program
class MyProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question):
        return self.generate(question=question)

# Optimize with GEPA
optimizer = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

optimized_program = optimizer.compile(
    student=MyProgram(),
    trainset=trainset,
    valset=valset
)

See DSPy GEPA tutorials for executable notebooks with real-world examples.

Troubleshooting

No improvement or low scores

Possible causes:

Insufficient budget: Increase max_metric_calls to 150-300
Weak reflection model: Use GPT-4o or o1 instead of smaller models
Poor seed prompt: Try a slightly better starting point
Misaligned metric: Ensure your evaluation metric rewards desired behavior

Solutions:

Double your budget and try again
Use the strongest reflection model you can afford
Check that your metric correctly scores examples

API rate limits or errors

Symptoms:

Rate limit errors from your LLM provider
Slow optimization runs

Solutions:

Reduce max_metric_calls to fit your rate limits
Use tier-appropriate limits (OpenAI Tier 3+ recommended)
GEPA automatically retries with exponential backoff
Consider using local models via Ollama for task_lm

Poor generalization to validation

Symptoms:

Training scores improve but validation scores don’t
Overfitting to training examples

Solutions:

Add more validation examples (10-20 minimum)
Increase diversity in training set
GEPA’s Pareto frontier naturally regularizes, but ensure your data represents real usage
Check for data leakage between train and validation sets

Optimization runs too long

Causes:

High max_metric_calls setting
Slow task model or evaluation function

Solutions:

Start with max_metric_calls=50 for initial experiments
Use faster task models (e.g., gpt-4o-mini instead of gpt-4o)
Reduce training set size to 20-30 examples
Check evaluation function for bottlenecks

Learn More

GEPA Paper

Research paper with detailed methodology and results

How It Works

Deep dive into GEPA’s optimization algorithm

Use Cases

Real-world applications across industries

API Reference

Complete API documentation and configuration options

Community & Support

Discord

Join our Discord for real-time help and discussion

GitHub

Star the repo, report issues, contribute adapters

Slack

Connect with other GEPA users and contributors

Blog

Latest updates, tutorials, and case studies

Get Started

Core Concepts

Guides

Use Cases

Quick Start

Quick Start with GEPA

What You’ll Build

What You Can Optimize

Prompts

Code

Agent Architectures

Configurations

Real-World Results

Key Concepts

Pareto-Efficient Search

Actionable Side Information (ASI)

LLM-Based Reflection

Configuration Tips

Choosing Models

Data Requirements

Budget Planning

Next Steps

Use with DSPy

Optimize Anything

RAG Optimization

Custom Adapters

Using GEPA with DSPy (Recommended)

Troubleshooting

Learn More

GEPA Paper

How It Works

Use Cases

API Reference

Community & Support

Discord

GitHub

Slack

Blog

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Use Cases

​Quick Start with GEPA

​What You’ll Build

​What You Can Optimize

Prompts

Code

Agent Architectures

Configurations

​Real-World Results

​Key Concepts

​Pareto-Efficient Search

​Actionable Side Information (ASI)

​LLM-Based Reflection

​Configuration Tips

​Choosing Models

​Data Requirements

​Budget Planning

​Next Steps

Use with DSPy

Optimize Anything

RAG Optimization

Custom Adapters

​Using GEPA with DSPy (Recommended)

​Troubleshooting

​Learn More

GEPA Paper

How It Works

Use Cases

API Reference

​Community & Support

Discord

GitHub

Slack

Blog

Build docs developers (and LLMs) love

Quick Start with GEPA

What You’ll Build

What You Can Optimize

Real-World Results

Key Concepts

Pareto-Efficient Search

Actionable Side Information (ASI)

LLM-Based Reflection

Configuration Tips

Choosing Models

Data Requirements

Budget Planning

Next Steps

Using GEPA with DSPy (Recommended)

Troubleshooting

Learn More

Community & Support