Skip to main content
GEPA’s optimize_anything API extends beyond prompts to optimize any text artifact, including code. From CUDA kernels to cloud scheduling policies, GEPA can evolve code by reading execution results, compiler errors, and profiling data.

Key Results

CUDA Kernels

87% of kernels match or beat baseline, 25% are 20%+ faster

Cloud Scheduling

40.2% cost savings on multi-cloud data transfer (CloudCast)

Coding Agents

55% → 82% resolve rate with auto-learned skills

Circle Packing

Outperforms AlphaEvolve’s solution by reaching score of 2.636+

The optimize_anything API

Basic Pattern

At its core, optimizing code requires just an artifact and an evaluator:
import gepa.optimize_anything as oa

def evaluate(candidate: str) -> float:
    result = execute_code(candidate)
    oa.log(f"Error: {result.stderr}")
    oa.log(f"Runtime: {result.time_ms}ms")
    return result.score

result = oa.optimize_anything(
    seed_candidate="<your initial code>",
    evaluator=evaluate,
)

print(result.best_candidate)

With Structured Feedback

For richer diagnostics, return a dictionary:
def evaluate(candidate: str) -> tuple[float, dict]:
    result = execute_code(candidate)
    return result.score, {
        "Error": result.stderr,
        "Output": result.stdout,
        "Runtime": f"{result.time_ms:.1f}ms",
        "Memory": f"{result.peak_memory_mb:.1f}MB",
    }

Use Case 1: CUDA Kernel Generation

Mode: Multi-Task Search Generate fast CUDA kernels for multiple PyTorch operations from KernelBench.

Results

  • 87% of generated kernels match or beat baseline performance
  • 48% are 10%+ faster than baseline
  • 25% are 20%+ faster than baseline
  • Multi-task mode outperforms single-task optimization through cross-transfer learning
import gepa.optimize_anything as oa

def evaluate_kernel(candidate: str, example: dict) -> tuple[float, dict]:
    """Compile and benchmark CUDA kernel."""
    compile_result = compile_cuda(candidate, example["reference_code"])
    
    if not compile_result.success:
        return 0.0, {
            "CompilationError": compile_result.stderr,
            "LineNumber": compile_result.error_line,
        }
    
    benchmark_result = benchmark_kernel(compile_result.binary, example["input"])
    
    speedup = benchmark_result.baseline_time / benchmark_result.kernel_time
    
    return speedup, {
        "BaselineTime": f"{benchmark_result.baseline_time:.2f}ms",
        "KernelTime": f"{benchmark_result.kernel_time:.2f}ms",
        "Speedup": f"{speedup:.2f}x",
        "ProfilerOutput": benchmark_result.profiler_trace,
    }

# Load KernelBench tasks
kernelbench_tasks = load_kernelbench_problems()

result = oa.optimize_anything(
    seed_candidate=cuda_template,
    evaluator=evaluate_kernel,
    dataset=kernelbench_tasks,  # Multi-task mode
    objective="Generate optimized CUDA kernels for PyTorch operations.",
    config=oa.GEPAConfig(
        engine=oa.EngineConfig(max_metric_calls=500),
        reflection=oa.ReflectionConfig(reflection_lm="openai/gpt-5"),
    ),
)

Multi-Task vs Single-Task

On the 10 problems where multi-task mode performed best, we re-optimized each from scratch in single-task mode:
Multi-task mode consistently outperforms dedicated single-task optimization across all speedup thresholds (1.0x, 1.1x, 1.2x).
Why it works: Optimization patterns discovered for one kernel (e.g., memory coalescing, warp-level operations) transfer to other kernels. The Pareto frontier preserves specialized strategies that excel on different operations.

Use Case 2: Cloud Scheduling Policies

Mode: Generalization

CloudCast: Multi-Cloud Data Transfer

Discover broadcast routing strategies for multi-cloud data transfer that minimize egress cost. Result: 40.2% cost savings on test set, outperforming expert heuristics and other LLM evolution frameworks.
def evaluate_cloudcast(candidate: str, example: dict) -> tuple[float, dict]:
    """Simulate data transfer and measure cost."""
    transfer_graph = build_transfer_graph(example)
    exec_globals = {}
    exec(candidate, exec_globals)
    
    routing_fn = exec_globals["route_transfer"]
    routes = routing_fn(transfer_graph, example["source"], example["destinations"])
    
    total_cost = simulate_transfer(routes, example["pricing"])
    baseline_cost = example["baseline_cost"]
    
    savings_pct = (baseline_cost - total_cost) / baseline_cost * 100
    
    return savings_pct, {
        "Cost": f"${total_cost:.2f}",
        "BaselineCost": f"${baseline_cost:.2f}",
        "Savings": f"{savings_pct:.1f}%",
        "RouteTopology": describe_topology(routes),
    }

train_scenarios, test_scenarios = load_cloudcast_dataset()

result = oa.optimize_anything(
    seed_candidate=baseline_dijkstra_routing,
    evaluator=evaluate_cloudcast,
    dataset=train_scenarios,
    valset=test_scenarios,
    objective="Minimize multi-cloud data transfer cost.",
    background="""
    The candidate is a Python function that takes a transfer graph and returns 
    a routing strategy. Consider:
    - Provider-specific egress pricing
    - Network topology constraints
    - Steiner tree vs shortest-path tradeoffs
    """,
)
Evolution: From baseline Dijkstra routing → provider-aware Steiner tree algorithm with Pareto-frontier candidate selection.

Can’t Be Late: Spot Instance Scheduling

Learn scheduling policies that decide when to use cheap SPOT instances vs reliable ON_DEMAND instances. Result: 7.8% cost savings while meeting all deadlines.
def evaluate_scheduling(candidate: str, example: dict) -> tuple[float, dict]:
    """Simulate task scheduling and measure cost."""
    exec_globals = {}
    exec(candidate, exec_globals)
    
    schedule_fn = exec_globals["schedule_tasks"]
    schedule = schedule_fn(example["tasks"], example["spot_availability"])
    
    result = simulate_execution(schedule, example)
    
    if result.deadline_missed:
        return 0.0, {"Error": "Deadline missed"}
    
    savings_pct = (result.baseline_cost - result.cost) / result.baseline_cost * 100
    
    return savings_pct, {
        "Cost": f"${result.cost:.2f}",
        "SpotUsage": f"{result.spot_pct:.1f}%",
        "DeadlineMargin": f"{result.deadline_margin}s",
    }
Evolution: From simple deadline-check heuristic → adaptive strategy tracking spot availability patterns and break-even switching costs.

Use Case 3: Coding Agent Skills

Mode: Generalization Optimize natural-language instructions (skills) for coding agents working with specific codebases.

Results

  • Bleve (Go search library): 24% → 93% pass rate
  • Jinja (Python template engine): 55% → 82% resolve rate
  • Transfer to Claude Code: Near-perfect accuracy with 47% faster resolution
import gepa.optimize_anything as oa

def evaluate_coding_skill(candidate: dict, example: dict) -> tuple[float, dict]:
    """Run coding agent with skill on a repository task."""
    agent = create_coding_agent(
        model="anthropic/claude-sonnet-4.5",
        skills=candidate["skills"]
    )
    
    result = agent.solve_task(
        repo=example["repo_path"],
        task=example["task_description"],
        timeout=300
    )
    
    tests_passed = run_tests(example["repo_path"])
    
    return float(tests_passed), {
        "TestsPassed": tests_passed,
        "AgentLog": result.execution_log,
        "Error": result.error if not tests_passed else None,
        "TimeSpent": f"{result.duration}s",
    }

# Load repository-specific tasks
bleve_train, bleve_test = load_bleve_tasks()

result = oa.optimize_anything(
    seed_candidate={"skills": "Follow best practices."},
    evaluator=evaluate_coding_skill,
    dataset=bleve_train,
    valset=bleve_test,
    objective="Learn repository-specific skills for the Bleve Go codebase.",
    background="""
    Bleve is a Go full-text search library. Common patterns:
    - Heavy use of interfaces
    - Custom index structures
    - Memory-mapped files
    """,
)

Learned Skills Example

GEPA discovers specific guidance like:
  • “When modifying index structures, always update both the in-memory and disk representations”
  • “Bleve uses custom serialization — check existing encode/decode methods before adding new ones”
  • “Tests require specific fixture files in testdata/ — ensure paths are relative to package root”
Read the full blog post →

Use Case 4: Circle Packing

Mode: Single-Task Search Pack n=26 circles to maximize the sum of their radii within a unit square. Result: Score of 2.636+, outperforming AlphaEvolve, ShinkaEvolve, and OpenEvolve.
def evaluate_circle_packing(candidate: str) -> tuple[float, dict]:
    """Execute packing algorithm and return sum of radii."""
    exec_globals = {"n": 26}
    exec(candidate, exec_globals)
    
    packing_fn = exec_globals["pack_circles"]
    circles = packing_fn(n=26)
    
    # Validate constraints
    violations = check_constraints(circles)
    if violations:
        return 0.0, {"Error": violations}
    
    total_radius = sum(c.radius for c in circles)
    
    return total_radius, {
        "TotalRadius": f"{total_radius:.5f}",
        "NumCircles": len(circles),
        "Packing": visualize_packing(circles),  # ASCII art or image
        "MinSpacing": f"{compute_min_spacing(circles):.5f}",
    }

result = oa.optimize_anything(
    seed_candidate=naive_packing_code,
    evaluator=evaluate_circle_packing,
    objective="Maximize sum of circle radii while fitting in unit square.",
)

Visual Progression

  • Metric Call 0: Score 0.98 — sparse, inefficient packing
  • Metric Call 50: Score 2.61 — tightly packed with hexagonal patterns
  • Metric Call 89: Score 2.64 — near-optimal with adaptive sizing

Use Case 5: Blackbox Mathematical Optimization

Mode: Single-Task Search Given a blackbox objective function, GEPA discovers an optimization algorithm tailored to it. Result: Matches Optuna on 56-problem EvalSet benchmark.
def evaluate_optimizer(candidate: str, example: dict) -> tuple[float, dict]:
    """Test custom optimizer on a blackbox function."""
    exec_globals = {}
    exec(candidate, exec_globals)
    
    optimizer_fn = exec_globals["optimize"]
    
    # Run optimizer on the blackbox function
    best_value, history = optimizer_fn(
        objective=example["objective_fn"],
        bounds=example["bounds"],
        budget=example["budget"],
    )
    
    # Score is distance from known optimum
    score = -abs(best_value - example["optimum"])
    
    return score, {
        "BestValue": f"{best_value:.6f}",
        "Optimum": f"{example['optimum']:.6f}",
        "Gap": f"{abs(best_value - example['optimum']):.6f}",
        "ConvergencePlot": plot_history(history),
    }

evalset_problems = load_evalset_benchmark()

result = oa.optimize_anything(
    seed_candidate=simple_random_search,
    evaluator=evaluate_optimizer,
    dataset=evalset_problems,
    objective="Discover optimization algorithm for blackbox functions.",
)

Discovered Strategies

  • Boundary optima: Discovers L-BFGS-B (box-constrained optimizer)
  • Deceptive traps: Designs multi-start search from diverse starting points
  • Smooth landscapes: Adapts CMA-ES-like covariance matrix adaptation

Actionable Side Information (ASI)

The key to effective code optimization is rich diagnostic feedback:
from gepa import Image

def evaluate_with_rich_asi(candidate: str) -> tuple[float, dict]:
    result = execute_code(candidate)
    
    return result.score, {
        # Text diagnostics
        "CompilerOutput": result.compiler_output,
        "TestResults": result.test_output,
        
        # Metrics
        "Runtime": result.runtime_ms,
        "Memory": result.peak_memory_mb,
        
        # Multi-objective scores
        "scores": {
            "performance": result.speedup,
            "memory_efficiency": result.memory_score,
            "code_quality": result.maintainability_score,
        },
        
        # Visual feedback (for VLMs)
        "ProfilerTrace": Image(base64_data=render_flamegraph(result.profile)),
    }
ASI is the text-optimization analogue of the gradient. Where gradients tell a numerical optimizer which direction to move, ASI tells an LLM proposer why a candidate failed and how to fix it.

Optimization Modes

Code optimization supports all three modes: Solve one hard problem:
oa.optimize_anything(
    seed_candidate=..., 
    evaluator=...  # No example argument
)
Examples: Circle packing, blackbox optimization Solve a batch of related problems with cross-transfer:
oa.optimize_anything(
    seed_candidate=..., 
    evaluator=...,  # Takes example argument
    dataset=tasks
)
Examples: CUDA kernels, multiple optimization problems

3. Generalization

Build a skill that transfers to unseen problems:
oa.optimize_anything(
    seed_candidate=..., 
    evaluator=...,
    dataset=train, 
    valset=val
)
Examples: Cloud scheduling, coding agent skills

Best Practices

Include compiler errors, test results, profiler output, and runtime metrics in ASI.
Return scores dict in ASI for performance, memory, code quality, etc. GEPA’s Pareto frontier preserves trade-offs.
Code execution can hang — use timeouts in your evaluator.
Return score of 0 (or negative) when code violates hard constraints.
Don’t know how to write initial code? Set seed_candidate=None and provide detailed background.

Comparison with Other Frameworks

FeatureGEPAAlphaEvolveOpenEvolveShinkaEvolve
Multi-task search
Generalization mode
Structured ASILimitedLimitedLimited
Visual feedback
Pareto selection
Seedless mode

Next Steps

Agent Architecture

Discover optimal agent designs

Prompt Optimization

Optimize LLM prompts with GEPA

API Reference

Complete optimize_anything API docs

Examples

Browse code examples on GitHub

Build docs developers (and LLMs) love