Skip to main content

Simple Prompt Optimization Tutorial

Learn the fundamentals of prompt optimization with GEPA through a minimal, easy-to-understand example. This tutorial walks you through optimizing a system prompt in just a few lines of code.

Overview

GEPA (Genetic-Pareto) uses LLM-based reflection and evolutionary search to optimize text parameters. Unlike traditional methods that only see scalar scores, GEPA reads full execution traces to understand why candidates fail and propose targeted improvements.
1

Install GEPA

pip install gepa
GEPA works with any LLM provider supported by LiteLLM (OpenAI, Anthropic, local models via Ollama, etc.).
2

Prepare Your Data

Create training and validation datasets. Each example should have an input and expected output:
trainset = [
    {
        "input": "What is machine learning?",
        "answer": "Machine learning is a method of data analysis that automates "
                  "analytical model building..."
    },
    {
        "input": "Explain neural networks",
        "answer": "Neural networks are computing systems inspired by biological "
                  "neural networks..."
    },
    # Add more examples...
]

valset = [
    {
        "input": "What is deep learning?",
        "answer": "Deep learning is a subset of machine learning based on "
                  "artificial neural networks..."
    },
    # Add validation examples...
]
Best Practices:
  • Use 10-50 training examples for good results
  • Keep 20-30% of data for validation
  • Ensure examples cover diverse aspects of your task
3

Define the Seed Prompt

Start with a basic prompt as your baseline:
seed_prompt = {
    "system_prompt": "You are a helpful AI assistant. Answer questions clearly and accurately."
}
GEPA will evolve this into a more effective, task-specific prompt.
4

Run GEPA Optimization

Optimize your prompt with a single function call:
import gepa

result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4o-mini",      # Model being optimized
    max_metric_calls=50,                 # Number of iterations
    reflection_lm="openai/gpt-4o",      # Model generating improvements
)

print("Optimized prompt:")
print(result.best_candidate['system_prompt'])
print(f"\nValidation score: {result.val_aggregate_scores[result.best_idx]:.3f}")
What’s happening:
  1. GEPA evaluates the seed prompt on training examples
  2. An LLM reflects on failures and proposes improvements
  3. Better prompts are selected using Pareto-efficient search
  4. Process repeats for max_metric_calls iterations
5

Understand the Output

GEPA returns a GEPAResult object containing:
# Best optimized prompt
best_prompt = result.best_candidate

# Validation scores for all candidates
val_scores = result.val_aggregate_scores

# Index of best candidate
best_idx = result.best_idx

# Total optimization iterations
total_calls = result.total_metric_calls

print(f"Tried {total_calls} candidates")
print(f"Best validation score: {val_scores[best_idx]:.3f}")
print(f"Improvement: {val_scores[best_idx] - val_scores[0]:.3f}")
6

Use the Optimized Prompt

Deploy your optimized prompt in production:
import litellm

def answer_question(question: str, optimized_prompt: dict) -> str:
    """Use the optimized prompt to answer questions"""
    response = litellm.completion(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": optimized_prompt["system_prompt"]},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

# Use it
answer = answer_question(
    "What is reinforcement learning?",
    result.best_candidate
)
print(answer)

Complete Example

Here’s a full working example:
import gepa

# 1. Prepare data
trainset = [
    {"input": "What is AI?", "answer": "Artificial Intelligence..."},
    {"input": "What is ML?", "answer": "Machine Learning..."},
    {"input": "What is DL?", "answer": "Deep Learning..."},
]

valset = [
    {"input": "What is NLP?", "answer": "Natural Language Processing..."},
]

# 2. Define seed prompt
seed_prompt = {
    "system_prompt": "You are a helpful AI assistant."
}

# 3. Optimize
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4o-mini",
    max_metric_calls=30,
    reflection_lm="openai/gpt-4o",
)

# 4. Review results
print("Original:", seed_prompt["system_prompt"])
print("\nOptimized:", result.best_candidate["system_prompt"])
print(f"\nScore improvement: +{result.val_aggregate_scores[result.best_idx]:.3f}")

Key Concepts

Pareto-Efficient Search

GEPA maintains a frontier of candidates, keeping any that excel on specific examples—even if their average score is lower.

Actionable Side Information

Unlike methods that only see pass/fail scores, GEPA reads error messages, reasoning traces, and execution details.

LLM-Based Reflection

A reflection LLM analyzes failures, diagnoses root causes, and proposes targeted improvements—not random mutations.

Few Evaluations

Achieves strong results with 50-150 evaluations vs. 5,000-25,000+ for reinforcement learning methods.

Configuration Options

Model Selection

# Use different models for task and reflection
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4o-mini",       # Cheaper model being optimized
    reflection_lm="openai/o1",           # Smarter model for reflection
    max_metric_calls=100,
)

Custom Metrics

def custom_metric(prediction, ground_truth, trace=None):
    """Define how to score predictions"""
    # Exact match
    if prediction.strip().lower() == ground_truth.strip().lower():
        return 1.0
    
    # Partial credit for keyword overlap
    pred_words = set(prediction.lower().split())
    true_words = set(ground_truth.lower().split())
    overlap = len(pred_words & true_words) / len(true_words)
    
    return overlap

# Use custom metric
from gepa.adapters.default_adapter import DefaultAdapter

adapter = DefaultAdapter(
    task_lm="openai/gpt-4o-mini",
    metric=custom_metric
)

result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    adapter=adapter,
    reflection_lm="openai/gpt-4o",
    max_metric_calls=50,
)

Local Models with Ollama

# Use local models via Ollama (requires ollama running locally)
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="ollama/llama3.1:8b",
    reflection_lm="ollama/llama3.1:70b",
    max_metric_calls=50,
)

Troubleshooting

  • Increase budget: Try max_metric_calls=100 or higher
  • Better reflection model: Use GPT-4o or o1 for reflection
  • More diverse examples: Ensure trainset covers edge cases
  • Check metric: Verify your evaluation metric is meaningful
  • Add delays: GEPA respects rate limits automatically
  • Use tier-appropriate limits: Set max_metric_calls based on your API tier
  • Monitor costs: Each metric call uses the task_lm once
  • More validation data: Use at least 5-10 validation examples
  • Regularization: GEPA’s Pareto frontier naturally prevents overfitting
  • Data quality: Ensure validation set represents real usage

Next Steps

Math Optimization

Optimize prompts for complex mathematical reasoning tasks

RAG Pipeline

Optimize entire RAG systems with multiple vector stores

Agent Architecture

Evolve complete agent systems beyond just prompts

API Reference

Explore all configuration options and advanced features

Learn More

Build docs developers (and LLMs) love