Skip to main content

Overview

Comet ML Opik is an open-source platform for evaluating, testing, and monitoring LLM applications. GEPA serves as the core optimization algorithm in Opik’s Agent Optimizer, enabling automated improvement of AI agents through reflective evolution.

Why GEPA in Opik?

Opik integrates GEPA as its primary optimization algorithm for agent development:
  • Built-in optimization: GEPA is natively integrated into Opik’s Agent Optimizer
  • Evaluation-driven: Leverage Opik’s evaluation framework with GEPA’s optimization
  • Production monitoring: Track optimization progress alongside production metrics
  • Open source: Both Opik and GEPA are open-source and free to use

Setup

Install Opik:
pip install opik
Configure Opik (optional for local use):
opik configure

Using GEPA in Opik

Opik provides the GEPAOptimizer class as part of its agent optimization toolkit:
from opik.evaluation import evaluate
from opik.agent_optimizer import GEPAOptimizer
import opik

# Initialize Opik client
client = opik.Opik()

# Define your evaluation task
def evaluation_task(x):
    # Your agent logic here
    prompt = x['prompt']
    question = x['question']
    
    # Run your agent with the prompt
    response = run_agent(prompt, question)
    return {"response": response}

# Define scoring metric
def accuracy_metric(x, y):
    expected = x['expected_answer']
    actual = y['response']
    return 1.0 if expected.lower() in actual.lower() else 0.0

# Create dataset
dataset = client.create_dataset(
    name="my-optimization-dataset",
    description="Dataset for agent optimization"
)

# Add examples to dataset
dataset.insert([
    {
        "question": "What is the capital of France?",
        "expected_answer": "Paris"
    },
    {
        "question": "What is 15 + 27?",
        "expected_answer": "42"
    },
])

# Initialize GEPA optimizer
optimizer = GEPAOptimizer(
    reflection_lm="openai/gpt-4o",
    max_iterations=50,
)

# Run optimization
result = optimizer.optimize(
    seed_prompt="You are a helpful assistant.",
    dataset=dataset,
    evaluation_task=evaluation_task,
    scoring_metrics=[accuracy_metric],
)

print("Optimized prompt:", result.best_prompt)
print("Best score:", result.best_score)

Advanced Configuration

Custom Evaluation Metrics

Opik supports multiple evaluation metrics simultaneously:
from opik.evaluation.metrics import Metric

# Define custom metrics
class RelevanceMetric(Metric):
    def score(self, x, y):
        # Custom scoring logic
        return compute_relevance(x, y)

class CompletenessMetric(Metric):
    def score(self, x, y):
        # Custom scoring logic
        return compute_completeness(x, y)

# Use multiple metrics
result = optimizer.optimize(
    seed_prompt="You are a helpful assistant.",
    dataset=dataset,
    evaluation_task=evaluation_task,
    scoring_metrics=[RelevanceMetric(), CompletenessMetric()],
)

Multi-Component Optimization

Optimize multiple prompt components in your agent:
# Define multi-component seed
seed_candidate = {
    "system_prompt": "You are a helpful assistant.",
    "task_instruction": "Answer the user's question clearly and concisely.",
    "output_format": "Provide your answer in a single sentence.",
}

# Evaluation task using multiple components
def multi_component_task(x):
    response = run_agent(
        system_prompt=x['system_prompt'],
        instruction=x['task_instruction'],
        format_guide=x['output_format'],
        question=x['question'],
    )
    return {"response": response}

# Optimize all components
result = optimizer.optimize(
    seed_prompt=seed_candidate,
    dataset=dataset,
    evaluation_task=multi_component_task,
    scoring_metrics=[accuracy_metric],
)

print("Optimized system prompt:", result.best_candidate['system_prompt'])
print("Optimized instruction:", result.best_candidate['task_instruction'])
print("Optimized format:", result.best_candidate['output_format'])

Tracking and Monitoring

Opik automatically tracks optimization progress:
# Access optimization history
for iteration in result.history:
    print(f"Iteration {iteration.number}:")
    print(f"  Score: {iteration.score}")
    print(f"  Candidate: {iteration.candidate}")

# View in Opik dashboard
print(f"View results: {result.opik_url}")

Integration with Opik Evaluation

Combine GEPA optimization with Opik’s evaluation framework:
from opik.evaluation import evaluate
from opik.agent_optimizer import GEPAOptimizer

# Create evaluation experiment
experiment = evaluate(
    experiment_name="agent-optimization",
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[accuracy_metric],
)

# Run optimization on top of evaluation
optimizer = GEPAOptimizer(
    reflection_lm="openai/gpt-4o",
    max_iterations=100,
)

result = optimizer.optimize_from_experiment(
    experiment=experiment,
    seed_prompt="You are a helpful assistant.",
)

Best Practices

  1. Start with quality datasets: Use diverse, representative examples in your Opik dataset
  2. Define clear metrics: Create specific, measurable scoring functions
  3. Monitor costs: Track API usage during optimization in the Opik dashboard
  4. Iterate incrementally: Start with small iteration counts, then scale up
  5. Use validation sets: Separate training and validation data in your datasets

Opik Platform Features

When using GEPA through Opik, you get access to:

Visual Dashboard

Track optimization progress with real-time visualizations

Experiment Comparison

Compare multiple optimization runs side-by-side

Dataset Management

Organize and version your evaluation datasets

Collaboration

Share optimization results with your team

Example: E-commerce Agent

Complete example optimizing an e-commerce support agent:
from opik.evaluation import evaluate
from opik.agent_optimizer import GEPAOptimizer
import opik

# Initialize client
client = opik.Opik()

# Create dataset with support queries
dataset = client.create_dataset(name="ecommerce-support")
dataset.insert([
    {
        "query": "Where is my order #12345?",
        "expected_action": "lookup_order",
        "expected_tone": "helpful",
    },
    {
        "query": "I want to return this item",
        "expected_action": "initiate_return",
        "expected_tone": "accommodating",
    },
    {
        "query": "Your product broke after one day!",
        "expected_action": "escalate_complaint",
        "expected_tone": "empathetic",
    },
])

# Define evaluation task
def support_agent_task(x):
    prompt = x['prompt']
    query = x['query']
    
    # Your agent implementation
    response = run_support_agent(prompt, query)
    
    return {
        "response": response,
        "action": extract_action(response),
        "tone": analyze_tone(response),
    }

# Define scoring metrics
def action_accuracy(x, y):
    return 1.0 if x['expected_action'] == y['action'] else 0.0

def tone_quality(x, y):
    return 1.0 if x['expected_tone'] == y['tone'] else 0.0

# Run optimization
optimizer = GEPAOptimizer(
    reflection_lm="openai/gpt-4o",
    max_iterations=75,
)

result = optimizer.optimize(
    seed_prompt="You are a customer support agent. Help customers with their inquiries.",
    dataset=dataset,
    evaluation_task=support_agent_task,
    scoring_metrics=[action_accuracy, tone_quality],
)

print("Optimized support agent prompt:")
print(result.best_prompt)
print(f"\nFinal accuracy: {result.best_score:.2%}")
print(f"View full results: {result.opik_url}")

External Resources

Opik GEPA Optimizer Documentation

Official Comet ML Opik documentation for GEPA optimization

Opik Getting Started

Learn more about Comet ML Opik platform

Opik Python SDK

Complete Python SDK reference

Build docs developers (and LLMs) love