Prompt Optimization

Iteratively improve prompts through testing, evaluation, and feedback. Multiple levels of nesting: run → evaluate → human feedback → improve.

The Pattern

Outer loop (human feedback):
└── Inner loop (variant testing):
    └── Pipeline under test (the prompt being optimized)

This demonstrates hypergraph’s natural hierarchy — cycles inside DAGs, DAGs inside cycles, at multiple levels.

Implementation

Define the pipeline being optimized

Start with the system you want to improve:

from hypergraph import Graph, node, route, END, AsyncRunner
from anthropic import Anthropic
import json

client = Anthropic()

@node(output_name="response")
def generate(query: str, system_prompt: str) -> str:
    """The pipeline under test - uses the system prompt we're optimizing."""

    message = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": query}],
    )

    return message.content[0].text


pipeline = Graph([generate], name="pipeline")

Generate prompt variants

Use an LLM to create improved versions based on feedback:

@node(output_name="variants")
def generate_variants(
    base_prompt: str,
    feedback: str = "",
    num_variants: int = 3,
) -> list[str]:
    """
    Generate prompt variants based on feedback.
    Uses Claude Opus 4.5 for high-quality prompt engineering.
    """

    instruction = f"""Generate {num_variants} variations of this system prompt.
Each variation should be meaningfully different while preserving the core intent.

Base prompt:
{base_prompt}
"""

    if feedback:
        instruction += f"""
Previous feedback to incorporate:
{feedback}
"""

    instruction += """
Return a JSON array of strings, each being a complete system prompt.
No explanation, just the JSON array."""

    message = client.messages.create(
        model="claude-opus-4-5-20251101",
        max_tokens=2048,
        messages=[{"role": "user", "content": instruction}],
    )

    variants = json.loads(message.content[0].text)
    return [base_prompt] + variants  # Include original for comparison

Use a more powerful model (Opus) for prompt generation than the model being optimized (Sonnet).

Test variants

Run each variant against test cases and score the results:

@node(output_name="test_results")
async def test_variants(
    variants: list[str],
    test_cases: list[dict],
) -> list[dict]:
    """
    Test each variant against the test cases.
    Returns scores for each variant.
    """
    runner = AsyncRunner()
    results = []

    for i, variant in enumerate(variants):
        scores = []

        for test in test_cases:
            # Run the pipeline with this variant
            result = await runner.run(pipeline, {
                "query": test["query"],
                "system_prompt": variant,
            })

            # Score the response
            score = evaluate_response(
                response=result["response"],
                expected=test.get("expected_keywords", []),
                criteria=test.get("criteria", {}),
            )
            scores.append(score)

        results.append({
            "variant_index": i,
            "prompt": variant[:100] + "..." if len(variant) > 100 else variant,
            "full_prompt": variant,
            "avg_score": sum(scores) / len(scores),
            "scores": scores,
        })

    return sorted(results, key=lambda x: x["avg_score"], reverse=True)


def evaluate_response(response: str, expected_keywords: list, criteria: dict) -> float:
    """Score a response (0-1)."""
    score = 0.0

    # Keyword coverage
    if expected_keywords:
        found = sum(1 for kw in expected_keywords if kw.lower() in response.lower())
        score += 0.5 * (found / len(expected_keywords))

    # Length criteria
    if "min_length" in criteria:
        if len(response) >= criteria["min_length"]:
            score += 0.25

    # Format criteria
    if "must_include" in criteria:
        if all(s in response for s in criteria["must_include"]):
            score += 0.25

    return min(score, 1.0)

Select the best variant

@node(output_name="best_variant")
def select_best(test_results: list[dict]) -> dict:
    """Select the best performing variant."""
    return test_results[0]  # Already sorted by score

Create the optimization loop

Automated iteration until target score or max iterations:

@node(output_name="iteration")
def track_iteration(iteration: int = 0) -> int:
    """Track iteration count."""
    return iteration + 1


@route(targets=["generate_variants", END])
def optimization_gate(
    best_variant: dict,
    iteration: int,
    target_score: float = 0.9,
    max_iterations: int = 5,
) -> str:
    """Decide if optimization should continue."""

    if best_variant["avg_score"] >= target_score:
        print(f"✓ Target score reached: {best_variant['avg_score']:.2f}")
        return END

    if iteration >= max_iterations:
        print(f"✓ Max iterations reached. Best score: {best_variant['avg_score']:.2f}")
        return END

    print(f"→ Iteration {iteration}: score={best_variant['avg_score']:.2f}, continuing...")
    return "generate_variants"


optimization_loop = Graph([
    generate_variants,
    test_variants,
    select_best,
    track_iteration,
    optimization_gate,
], name="optimization")

This inner loop runs automatically until it reaches the target score or maximum iterations.

Add human-in-the-loop

Wrap the optimization loop in a human feedback cycle:

@node(output_name="feedback")
def get_human_feedback(best_variant: dict, test_results: list[dict]) -> str:
    """
    Display results and get human feedback.
    In production, this might be a web UI or API call.
    """
    print("\n" + "=" * 60)
    print("OPTIMIZATION RESULTS")
    print("=" * 60)

    for i, result in enumerate(test_results[:3]):  # Top 3
        print(f"\n#{i+1} (score: {result['avg_score']:.2f})")
        print(f"   {result['prompt']}")

    print("\n" + "-" * 60)
    print(f"Best prompt (score: {best_variant['avg_score']:.2f}):")
    print(best_variant["full_prompt"])
    print("-" * 60)

    feedback = input("\nFeedback (or 'done' to finish): ").strip()
    return feedback


@route(targets=["optimization", END])
def human_gate(feedback: str) -> str:
    """Check if human wants to continue."""
    if feedback.lower() in ("done", "quit", "exit", ""):
        return END
    return "optimization"


human_loop = Graph([
    optimization_loop.as_node(),  # Inner loop as a node
    get_human_feedback,
    human_gate,
], name="human_optimization")

The optimization loop (cyclic) is nested inside the human feedback loop (also cyclic). Two levels of cycles!

Run the full system

async def main():
    runner = AsyncRunner()

    # Test cases for evaluation
    test_cases = [
        {
            "query": "Explain quantum computing to a beginner",
            "expected_keywords": ["qubit", "superposition", "classical"],
            "criteria": {"min_length": 200},
        },
        {
            "query": "What is machine learning?",
            "expected_keywords": ["data", "algorithm", "pattern"],
            "criteria": {"min_length": 150},
        },
        {
            "query": "How does encryption work?",
            "expected_keywords": ["key", "secure", "decrypt"],
            "criteria": {"min_length": 150},
        },
    ]

    result = await runner.run(human_loop, {
        "base_prompt": "You are a helpful assistant that explains technical concepts.",
        "test_cases": test_cases,
        "target_score": 0.85,
        "max_iterations": 3,
    })

    print("\n" + "=" * 60)
    print("FINAL OPTIMIZED PROMPT:")
    print("=" * 60)
    print(result["best_variant"]["full_prompt"])
    print(f"\nFinal score: {result['best_variant']['avg_score']:.2f}")


import asyncio
asyncio.run(main())

Key Patterns

Multiple Nesting Levels

human_loop (cyclic)
└── optimization_loop.as_node() (cyclic)
    └── pipeline (DAG)

Three levels of composition: DAG inside cycle inside another cycle.

Automated Testing

Each prompt variant is tested against a suite of test cases with scoring criteria.

Human-in-the-Loop

The outer loop pauses for human feedback, allowing guided optimization.

Early Termination

Both loops can terminate early when goals are reached:

Inner loop: Target score or max iterations
Outer loop: Human says “done”

Variations

A/B Testing

Compare two prompts directly:

@node(output_name="winner")
async def ab_test(prompt_a: str, prompt_b: str, test_cases: list) -> dict:
    """A/B test two prompts."""
    runner = AsyncRunner()

    results_a = await test_variants.func([prompt_a], test_cases)
    results_b = await test_variants.func([prompt_b], test_cases)

    return {
        "winner": "A" if results_a[0]["avg_score"] > results_b[0]["avg_score"] else "B",
        "score_a": results_a[0]["avg_score"],
        "score_b": results_b[0]["avg_score"],
    }

LLM-as-Judge

Use an LLM to evaluate responses:

@node(output_name="score")
def llm_evaluate(response: str, query: str, criteria: str) -> float:
    """Use Claude to evaluate response quality."""

    message = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Rate this response from 0 to 1.

Query: {query}
Response: {response}
Criteria: {criteria}

Return only a number between 0 and 1.""",
        }],
    )

    return float(message.content[0].text.strip())

Automated Human Feedback

Simulate human feedback for testing:

@node(output_name="feedback")
def simulate_human_feedback(best_variant: dict, iteration: int) -> str:
    """Simulate human feedback based on score."""
    if best_variant["avg_score"] >= 0.9:
        return "done"

    suggestions = [
        "Make it more concise",
        "Add more examples",
        "Use simpler language",
        "Be more technical",
    ]

    return suggestions[iteration % len(suggestions)]

Testing

import pytest

@pytest.mark.asyncio
async def test_variant_generation():
    variants = generate_variants.func(
        base_prompt="You are a helpful assistant.",
        feedback="Be more specific",
        num_variants=2,
    )

    assert len(variants) == 3  # base + 2 variants
    assert all(isinstance(v, str) for v in variants)

def test_evaluation():
    score = evaluate_response(
        response="Machine learning uses data to find patterns in algorithms.",
        expected_keywords=["data", "algorithm", "pattern"],
        criteria={"min_length": 20},
    )

    assert 0 <= score <= 1
    assert score > 0.5  # Should match all keywords

Production Patterns

Save optimization history

import json
from datetime import datetime

@node(output_name="saved_path")
def save_results(best_variant: dict, iteration: int) -> str:
    """Save optimization results to disk."""
    results = {
        "timestamp": datetime.utcnow().isoformat(),
        "iteration": iteration,
        "prompt": best_variant["full_prompt"],
        "score": best_variant["avg_score"],
    }

    path = f"optimization_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(path, "w") as f:
        json.dump(results, f, indent=2)

    return path

Add monitoring

from hypergraph.events import EventProcessor, NodeEndEvent

class OptimizationMonitor(EventProcessor):
    def __init__(self):
        self.scores = []

    def process(self, event):
        if isinstance(event, NodeEndEvent) and event.node_name == "select_best":
            score = event.outputs["best_variant"]["avg_score"]
            self.scores.append(score)
            print(f"Iteration {len(self.scores)}: score={score:.3f}")

monitor = OptimizationMonitor()
runner = AsyncRunner(event_processor=monitor)

What’s Next?

Evaluation Harness

Test systems at scale with nested graphs

Multi-Turn RAG

Optimize conversational systems

Hierarchical Composition

Deep dive on nesting patterns

Events & Monitoring

Monitor graph execution

Get Started

Core Concepts

Patterns

Real-World Examples

How-To Guides

The Pattern

Implementation

Key Patterns

Multiple Nesting Levels

Automated Testing

Human-in-the-Loop

Early Termination

Variations

A/B Testing

LLM-as-Judge

Automated Human Feedback

Testing

Production Patterns

Save optimization history

Add monitoring

What’s Next?

Evaluation Harness

Multi-Turn RAG

Hierarchical Composition

Events & Monitoring

Build docs developers (and LLMs) love

Get Started

Core Concepts

Patterns

Real-World Examples

How-To Guides

​The Pattern

​Implementation

​Key Patterns

​Multiple Nesting Levels

​Automated Testing

​Human-in-the-Loop

​Early Termination

​Variations

​A/B Testing

​LLM-as-Judge

​Automated Human Feedback

​Testing

​Production Patterns

​Save optimization history

​Add monitoring

​What’s Next?

Evaluation Harness

Multi-Turn RAG

Hierarchical Composition

Events & Monitoring

Build docs developers (and LLMs) love

The Pattern

Implementation

Key Patterns

Multiple Nesting Levels

Automated Testing

Human-in-the-Loop

Early Termination

Variations

A/B Testing

LLM-as-Judge

Automated Human Feedback

Testing

Production Patterns

Save optimization history

Add monitoring

What’s Next?