Skip to main content
GEPA (Genetic-Pareto) optimizes any system with textual parameters against any evaluation metric. From prompt optimization to agent architecture discovery, GEPA has demonstrated significant improvements across production systems at companies like Databricks, Shopify, Dropbox, and OpenAI.

Why GEPA?

If you can measure it, you can optimize it: prompts, code, agent architectures, scheduling policies, vector graphics, and more. Unlike RL or gradient-based methods that collapse execution traces into a single scalar reward, GEPA uses LLMs to read full execution traces — error messages, profiling data, reasoning logs — to diagnose why a candidate failed and propose targeted fixes.

Key Results at a Glance

90x Cheaper

Open-source models + GEPA beat Claude Opus 4.1 at Databricks

35x Faster than RL

100–500 evaluations vs. 5,000–25,000+ for GRPO

32% → 89%

ARC-AGI agent accuracy via architecture discovery

40.2% Cost Savings

Cloud scheduling policy discovered by GEPA, beating expert heuristics

55% → 82%

Coding agent resolve rate on Jinja via auto-learned skills

50+ Production Uses

Across Shopify, Databricks, Dropbox, OpenAI, Pydantic, MLflow, Comet ML

Use Case Categories

Prompt Optimization

Improve LLM accuracy through reflective prompt evolution. Achieve 46.6% → 56.6% on AIME with GPT-4.1 Mini.

Code Optimization

Generate and optimize code including CUDA kernels, scheduling policies, and system algorithms.

Agent Architecture

Discover optimal agent designs through evolutionary search. Nearly triple accuracy on ARC-AGI.

RAG Optimization

Optimize retrieval pipelines, reranking strategies, and query generation for better context.

When GEPA Shines

Scientific simulations, complex agents with tool calls, slow compilation. GEPA needs 100–500 evals vs 10K+ for RL.
Works with as few as 3 examples. No large training sets required.
No weights access needed. Optimize GPT-5, Claude, Gemini directly through their APIs.
Human-readable optimization traces show why each prompt changed.
Use GEPA for rapid initial optimization, then apply RL/fine-tuning for additional gains.

Real-World Impact

“Both DSPy and (especially) GEPA are currently severely under hyped in the AI context engineering world” Tobi Lutke, CEO, Shopify

Production Deployments

GEPA is used in production at:
  • Databricks: 90x cost reduction with enterprise agents
  • Shopify: Optimizing AI context engineering
  • Dropbox: Production AI systems
  • OpenAI: Featured in official cookbook for self-evolving agents
  • Pydantic: Contact extraction improved from 86% → 97%
  • MLflow: Integrated as mlflow.genai.optimize_prompts()
  • Comet ML: Core algorithm in Opik Agent Optimizer

Research Impact

  • 35x faster than RL: 100–500 evaluations vs. 5,000–25,000+ for GRPO (paper)
  • State-of-the-art results: MATH benchmark 67% → 93% with DSPy Full Program optimization
  • Sample efficiency: Works with minimal data where RL requires thousands of examples

Getting Started

Ready to optimize your AI system? Choose your use case:

Quick Start

Get up and running with GEPA in minutes

Adapters Guide

Learn how to integrate GEPA with your system

API Reference

Complete API documentation

Community

Join our Discord community

Optimization Modes

GEPA supports three optimization paradigms:
  1. Single-Task Search: Solve one hard problem (e.g., circle packing, blackbox optimization)
  2. Multi-Task Search: Solve a batch of related problems with cross-transfer (e.g., CUDA kernels)
  3. Generalization: Build a skill that transfers to unseen problems (e.g., prompt optimization, agent architecture)
The same API works across all modes — learn more in each use case section.

Build docs developers (and LLMs) love