Code Optimization

GEPA’s optimize_anything API extends beyond prompts to optimize any text artifact, including code. From CUDA kernels to cloud scheduling policies, GEPA can evolve code by reading execution results, compiler errors, and profiling data.

Key Results

CUDA Kernels

87% of kernels match or beat baseline, 25% are 20%+ faster

Cloud Scheduling

40.2% cost savings on multi-cloud data transfer (CloudCast)

Coding Agents

55% → 82% resolve rate with auto-learned skills

Circle Packing

Outperforms AlphaEvolve’s solution by reaching score of 2.636+

The optimize_anything API

Basic Pattern

At its core, optimizing code requires just an artifact and an evaluator:

import gepa.optimize_anything as oa

def evaluate(candidate: str) -> float:
    result = execute_code(candidate)
    oa.log(f"Error: {result.stderr}")
    oa.log(f"Runtime: {result.time_ms}ms")
    return result.score

result = oa.optimize_anything(
    seed_candidate="<your initial code>",
    evaluator=evaluate,
)

print(result.best_candidate)

With Structured Feedback

For richer diagnostics, return a dictionary:

def evaluate(candidate: str) -> tuple[float, dict]:
    result = execute_code(candidate)
    return result.score, {
        "Error": result.stderr,
        "Output": result.stdout,
        "Runtime": f"{result.time_ms:.1f}ms",
        "Memory": f"{result.peak_memory_mb:.1f}MB",
    }

Use Case 1: CUDA Kernel Generation

Mode: Multi-Task Search Generate fast CUDA kernels for multiple PyTorch operations from KernelBench.

Results

87% of generated kernels match or beat baseline performance
48% are 10%+ faster than baseline
25% are 20%+ faster than baseline
Multi-task mode outperforms single-task optimization through cross-transfer learning

import gepa.optimize_anything as oa

def evaluate_kernel(candidate: str, example: dict) -> tuple[float, dict]:
    """Compile and benchmark CUDA kernel."""
    compile_result = compile_cuda(candidate, example["reference_code"])
    
    if not compile_result.success:
        return 0.0, {
            "CompilationError": compile_result.stderr,
            "LineNumber": compile_result.error_line,
        }
    
    benchmark_result = benchmark_kernel(compile_result.binary, example["input"])
    
    speedup = benchmark_result.baseline_time / benchmark_result.kernel_time
    
    return speedup, {
        "BaselineTime": f"{benchmark_result.baseline_time:.2f}ms",
        "KernelTime": f"{benchmark_result.kernel_time:.2f}ms",
        "Speedup": f"{speedup:.2f}x",
        "ProfilerOutput": benchmark_result.profiler_trace,
    }

# Load KernelBench tasks
kernelbench_tasks = load_kernelbench_problems()

result = oa.optimize_anything(
    seed_candidate=cuda_template,
    evaluator=evaluate_kernel,
    dataset=kernelbench_tasks,  # Multi-task mode
    objective="Generate optimized CUDA kernels for PyTorch operations.",
    config=oa.GEPAConfig(
        engine=oa.EngineConfig(max_metric_calls=500),
        reflection=oa.ReflectionConfig(reflection_lm="openai/gpt-5"),
    ),
)

Multi-Task vs Single-Task

On the 10 problems where multi-task mode performed best, we re-optimized each from scratch in single-task mode:

Multi-task mode consistently outperforms dedicated single-task optimization across all speedup thresholds (1.0x, 1.1x, 1.2x).

Why it works: Optimization patterns discovered for one kernel (e.g., memory coalescing, warp-level operations) transfer to other kernels. The Pareto frontier preserves specialized strategies that excel on different operations.

Use Case 2: Cloud Scheduling Policies

Mode: Generalization

CloudCast: Multi-Cloud Data Transfer

Discover broadcast routing strategies for multi-cloud data transfer that minimize egress cost. Result: 40.2% cost savings on test set, outperforming expert heuristics and other LLM evolution frameworks.

def evaluate_cloudcast(candidate: str, example: dict) -> tuple[float, dict]:
    """Simulate data transfer and measure cost."""
    transfer_graph = build_transfer_graph(example)
    exec_globals = {}
    exec(candidate, exec_globals)
    
    routing_fn = exec_globals["route_transfer"]
    routes = routing_fn(transfer_graph, example["source"], example["destinations"])
    
    total_cost = simulate_transfer(routes, example["pricing"])
    baseline_cost = example["baseline_cost"]
    
    savings_pct = (baseline_cost - total_cost) / baseline_cost * 100
    
    return savings_pct, {
        "Cost": f"${total_cost:.2f}",
        "BaselineCost": f"${baseline_cost:.2f}",
        "Savings": f"{savings_pct:.1f}%",
        "RouteTopology": describe_topology(routes),
    }

train_scenarios, test_scenarios = load_cloudcast_dataset()

result = oa.optimize_anything(
    seed_candidate=baseline_dijkstra_routing,
    evaluator=evaluate_cloudcast,
    dataset=train_scenarios,
    valset=test_scenarios,
    objective="Minimize multi-cloud data transfer cost.",
    background="""
    The candidate is a Python function that takes a transfer graph and returns 
    a routing strategy. Consider:
    - Provider-specific egress pricing
    - Network topology constraints
    - Steiner tree vs shortest-path tradeoffs
    """,
)

Evolution: From baseline Dijkstra routing → provider-aware Steiner tree algorithm with Pareto-frontier candidate selection.

Can’t Be Late: Spot Instance Scheduling

Learn scheduling policies that decide when to use cheap SPOT instances vs reliable ON_DEMAND instances. Result: 7.8% cost savings while meeting all deadlines.

def evaluate_scheduling(candidate: str, example: dict) -> tuple[float, dict]:
    """Simulate task scheduling and measure cost."""
    exec_globals = {}
    exec(candidate, exec_globals)
    
    schedule_fn = exec_globals["schedule_tasks"]
    schedule = schedule_fn(example["tasks"], example["spot_availability"])
    
    result = simulate_execution(schedule, example)
    
    if result.deadline_missed:
        return 0.0, {"Error": "Deadline missed"}
    
    savings_pct = (result.baseline_cost - result.cost) / result.baseline_cost * 100
    
    return savings_pct, {
        "Cost": f"${result.cost:.2f}",
        "SpotUsage": f"{result.spot_pct:.1f}%",
        "DeadlineMargin": f"{result.deadline_margin}s",
    }

Evolution: From simple deadline-check heuristic → adaptive strategy tracking spot availability patterns and break-even switching costs.

Use Case 3: Coding Agent Skills

Mode: Generalization Optimize natural-language instructions (skills) for coding agents working with specific codebases.

Results

Bleve (Go search library): 24% → 93% pass rate
Jinja (Python template engine): 55% → 82% resolve rate
Transfer to Claude Code: Near-perfect accuracy with 47% faster resolution

import gepa.optimize_anything as oa

def evaluate_coding_skill(candidate: dict, example: dict) -> tuple[float, dict]:
    """Run coding agent with skill on a repository task."""
    agent = create_coding_agent(
        model="anthropic/claude-sonnet-4.5",
        skills=candidate["skills"]
    )
    
    result = agent.solve_task(
        repo=example["repo_path"],
        task=example["task_description"],
        timeout=300
    )
    
    tests_passed = run_tests(example["repo_path"])
    
    return float(tests_passed), {
        "TestsPassed": tests_passed,
        "AgentLog": result.execution_log,
        "Error": result.error if not tests_passed else None,
        "TimeSpent": f"{result.duration}s",
    }

# Load repository-specific tasks
bleve_train, bleve_test = load_bleve_tasks()

result = oa.optimize_anything(
    seed_candidate={"skills": "Follow best practices."},
    evaluator=evaluate_coding_skill,
    dataset=bleve_train,
    valset=bleve_test,
    objective="Learn repository-specific skills for the Bleve Go codebase.",
    background="""
    Bleve is a Go full-text search library. Common patterns:
    - Heavy use of interfaces
    - Custom index structures
    - Memory-mapped files
    """,
)

Learned Skills Example

GEPA discovers specific guidance like:

“When modifying index structures, always update both the in-memory and disk representations”
“Bleve uses custom serialization — check existing encode/decode methods before adding new ones”
“Tests require specific fixture files in testdata/ — ensure paths are relative to package root”

Read the full blog post →

Use Case 4: Circle Packing

Mode: Single-Task Search Pack n=26 circles to maximize the sum of their radii within a unit square. Result: Score of 2.636+, outperforming AlphaEvolve, ShinkaEvolve, and OpenEvolve.

def evaluate_circle_packing(candidate: str) -> tuple[float, dict]:
    """Execute packing algorithm and return sum of radii."""
    exec_globals = {"n": 26}
    exec(candidate, exec_globals)
    
    packing_fn = exec_globals["pack_circles"]
    circles = packing_fn(n=26)
    
    # Validate constraints
    violations = check_constraints(circles)
    if violations:
        return 0.0, {"Error": violations}
    
    total_radius = sum(c.radius for c in circles)
    
    return total_radius, {
        "TotalRadius": f"{total_radius:.5f}",
        "NumCircles": len(circles),
        "Packing": visualize_packing(circles),  # ASCII art or image
        "MinSpacing": f"{compute_min_spacing(circles):.5f}",
    }

result = oa.optimize_anything(
    seed_candidate=naive_packing_code,
    evaluator=evaluate_circle_packing,
    objective="Maximize sum of circle radii while fitting in unit square.",
)

Visual Progression

Metric Call 0: Score 0.98 — sparse, inefficient packing
Metric Call 50: Score 2.61 — tightly packed with hexagonal patterns
Metric Call 89: Score 2.64 — near-optimal with adaptive sizing

Use Case 5: Blackbox Mathematical Optimization

Mode: Single-Task Search Given a blackbox objective function, GEPA discovers an optimization algorithm tailored to it. Result: Matches Optuna on 56-problem EvalSet benchmark.

def evaluate_optimizer(candidate: str, example: dict) -> tuple[float, dict]:
    """Test custom optimizer on a blackbox function."""
    exec_globals = {}
    exec(candidate, exec_globals)
    
    optimizer_fn = exec_globals["optimize"]
    
    # Run optimizer on the blackbox function
    best_value, history = optimizer_fn(
        objective=example["objective_fn"],
        bounds=example["bounds"],
        budget=example["budget"],
    )
    
    # Score is distance from known optimum
    score = -abs(best_value - example["optimum"])
    
    return score, {
        "BestValue": f"{best_value:.6f}",
        "Optimum": f"{example['optimum']:.6f}",
        "Gap": f"{abs(best_value - example['optimum']):.6f}",
        "ConvergencePlot": plot_history(history),
    }

evalset_problems = load_evalset_benchmark()

result = oa.optimize_anything(
    seed_candidate=simple_random_search,
    evaluator=evaluate_optimizer,
    dataset=evalset_problems,
    objective="Discover optimization algorithm for blackbox functions.",
)

Discovered Strategies

Boundary optima: Discovers L-BFGS-B (box-constrained optimizer)
Deceptive traps: Designs multi-start search from diverse starting points
Smooth landscapes: Adapts CMA-ES-like covariance matrix adaptation

Actionable Side Information (ASI)

The key to effective code optimization is rich diagnostic feedback:

from gepa import Image

def evaluate_with_rich_asi(candidate: str) -> tuple[float, dict]:
    result = execute_code(candidate)
    
    return result.score, {
        # Text diagnostics
        "CompilerOutput": result.compiler_output,
        "TestResults": result.test_output,
        
        # Metrics
        "Runtime": result.runtime_ms,
        "Memory": result.peak_memory_mb,
        
        # Multi-objective scores
        "scores": {
            "performance": result.speedup,
            "memory_efficiency": result.memory_score,
            "code_quality": result.maintainability_score,
        },
        
        # Visual feedback (for VLMs)
        "ProfilerTrace": Image(base64_data=render_flamegraph(result.profile)),
    }

ASI is the text-optimization analogue of the gradient. Where gradients tell a numerical optimizer which direction to move, ASI tells an LLM proposer why a candidate failed and how to fix it.

Optimization Modes

Code optimization supports all three modes:

1. Single-Task Search

Solve one hard problem:

oa.optimize_anything(
    seed_candidate=..., 
    evaluator=...  # No example argument
)

Examples: Circle packing, blackbox optimization

2. Multi-Task Search

Solve a batch of related problems with cross-transfer:

oa.optimize_anything(
    seed_candidate=..., 
    evaluator=...,  # Takes example argument
    dataset=tasks
)

Examples: CUDA kernels, multiple optimization problems

3. Generalization

Build a skill that transfers to unseen problems:

oa.optimize_anything(
    seed_candidate=..., 
    evaluator=...,
    dataset=train, 
    valset=val
)

Examples: Cloud scheduling, coding agent skills

Best Practices

Provide rich execution feedback

Include compiler errors, test results, profiler output, and runtime metrics in ASI.

Use multi-objective scoring when relevant

Return scores dict in ASI for performance, memory, code quality, etc. GEPA’s Pareto frontier preserves trade-offs.

Set appropriate timeouts

Code execution can hang — use timeouts in your evaluator.

Validate constraints strictly

Return score of 0 (or negative) when code violates hard constraints.

Use seedless mode for unfamiliar domains

Don’t know how to write initial code? Set seed_candidate=None and provide detailed background.

Comparison with Other Frameworks

Feature	GEPA	AlphaEvolve	OpenEvolve	ShinkaEvolve
Multi-task search	✅	❌	❌	❌
Generalization mode	✅	❌	❌	❌
Structured ASI	✅	Limited	Limited	Limited
Visual feedback	✅	❌	❌	❌
Pareto selection	✅	❌	❌	❌
Seedless mode	✅	❌	❌	❌

Next Steps

Agent Architecture

Discover optimal agent designs

Prompt Optimization

Optimize LLM prompts with GEPA

API Reference

Complete optimize_anything API docs

Examples

Browse code examples on GitHub

Get Started

Core Concepts

Guides

Use Cases

Code Optimization

Key Results

CUDA Kernels

Cloud Scheduling

Coding Agents

Circle Packing

The optimize_anything API

Basic Pattern

With Structured Feedback

Use Case 1: CUDA Kernel Generation

Results

Multi-Task vs Single-Task

Use Case 2: Cloud Scheduling Policies

CloudCast: Multi-Cloud Data Transfer

Can’t Be Late: Spot Instance Scheduling

Use Case 3: Coding Agent Skills

Results

Learned Skills Example

Use Case 4: Circle Packing

Visual Progression

Use Case 5: Blackbox Mathematical Optimization

Discovered Strategies

Actionable Side Information (ASI)

Optimization Modes

1. Single-Task Search

2. Multi-Task Search

3. Generalization

Best Practices

Comparison with Other Frameworks

Next Steps

Agent Architecture

Prompt Optimization

API Reference

Examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Use Cases

​Key Results

CUDA Kernels

Cloud Scheduling

Coding Agents

Circle Packing

​The optimize_anything API

​Basic Pattern

​With Structured Feedback

​Use Case 1: CUDA Kernel Generation

​Results

​Multi-Task vs Single-Task

​Use Case 2: Cloud Scheduling Policies

​CloudCast: Multi-Cloud Data Transfer

​Can’t Be Late: Spot Instance Scheduling

​Use Case 3: Coding Agent Skills

​Results

​Learned Skills Example

​Use Case 4: Circle Packing

​Visual Progression

​Use Case 5: Blackbox Mathematical Optimization

​Discovered Strategies

​Actionable Side Information (ASI)

​Optimization Modes

​1. Single-Task Search

​2. Multi-Task Search

​3. Generalization

​Best Practices

​Comparison with Other Frameworks

​Next Steps

Agent Architecture

Prompt Optimization

API Reference

Examples

Build docs developers (and LLMs) love

Key Results

The optimize_anything API

Basic Pattern

With Structured Feedback

Use Case 1: CUDA Kernel Generation

Results

Multi-Task vs Single-Task

Use Case 2: Cloud Scheduling Policies

CloudCast: Multi-Cloud Data Transfer

Can’t Be Late: Spot Instance Scheduling

Use Case 3: Coding Agent Skills

Results

Learned Skills Example

Use Case 4: Circle Packing

Visual Progression

Use Case 5: Blackbox Mathematical Optimization

Discovered Strategies

Actionable Side Information (ASI)

Optimization Modes

1. Single-Task Search

2. Multi-Task Search

3. Generalization

Best Practices

Comparison with Other Frameworks

Next Steps