The Gradient Analogy
In numerical optimization, gradients tell us which direction to move :
# Neural network training
for x, y in dataloader:
pred = model(x)
loss = criterion(pred, y)
# Gradient: direction of steepest descent
loss.backward() # Computes ∂loss/∂weights
# Update: move weights in direction of negative gradient
optimizer.step() # weights -= lr * gradient
The gradient is actionable : it directly specifies how to improve the parameters.
But text parameters have no gradient. Prompts, code, and instructions are discrete, symbolic objects. You can’t compute ∂performance/∂prompt.
ASI is the text-optimization analogue of the gradient. Instead of numerical derivatives, ASI provides diagnostic feedback that tells an LLM:
Why a candidate failed
What went wrong
How to fix it
Traditional optimizers receive only scalar feedback:
# Traditional optimizer view
candidate = "Some text parameter"
score = evaluate(candidate) # Returns: 0.45
# We know the score is bad, but...
# - Why did it fail?
# - Which examples were wrong?
# - What errors occurred?
# - How should we modify the text?
GEPA evaluators return rich diagnostic information :
# GEPA evaluator with ASI
import gepa.optimize_anything as oa
def evaluate ( candidate , example ):
result = run_system(candidate, example)
score = compute_score(result)
# ASI: structured diagnostic feedback
side_info = {
"Input" : example[ "question" ],
"Output" : result.answer,
"Expected" : example[ "correct_answer" ],
"Reasoning" : result.reasoning_trace,
"Error" : result.error_message if result.failed else None ,
}
return score, side_info
The LLM reads this ASI and proposes targeted improvements:
LLM sees:
Input: "What is 2+2?"
Output: "The answer is 5"
Expected: "4"
Error: "Basic arithmetic failure"
LLM proposes:
"Add instruction: 'Double-check arithmetic calculations
using the following format: 2 + 2 = 4'"
ASI as the “Why” Signal
Traditional optimizers know that something failed. GEPA knows why :
Traditional GEPA (with ASI) score = 0.0"Compilation error: undefined variable 'tmp'"score = 0.3"Output format wrong: expected JSON, got plain text"score = 0.6"Correct on simple cases, fails on negative inputs"score = 0.9"Nearly perfect; minor edge case: empty list handling"
This transforms optimization from random search to guided refinement .
ASI Structure
ASI is represented as a dictionary with two categories of information:
1. Scores (Optional)
Multi-objective metrics for Pareto tracking:
side_info = {
"scores" : {
"accuracy" : 0.85 ,
"latency_inv" : 12.5 , # Higher is better (1/latency)
"cost_inv" : 8.3 ,
},
# ... contextual fields below
}
All scores must follow “higher is better” convention. Invert metrics like latency or error rate.
2. Contextual Fields
Diagnostic feedback that explains performance:
Common conventions:
side_info = {
# What went in
"Input" : example[ "question" ],
"Context" : example[ "background_knowledge" ],
# What came out
"Output" : result.generated_text,
"Expected" : example[ "correct_answer" ],
# Why it failed/succeeded
"Feedback" : qualitative_assessment(result),
"Error" : traceback if exception else None ,
# Intermediate states
"Reasoning" : result.chain_of_thought,
"Tool Calls" : result.function_invocations,
"Profiling" : result.execution_metrics,
}
Best practices:
Be specific : “Expected 42, got 17” beats “Wrong answer”
Include errors prominently : Tracebacks, compiler messages, validation failures
Show intermediate steps : Reasoning chains, tool outputs, state transitions
Add context : What was the input? What should have happened?
Use consistent field names : Pick a convention and stick to it
Providing ASI: Three Methods
Method 1: Return Tuple from Evaluator
Explicitly return (score, side_info):
def evaluate ( candidate , example ):
result = run_code(candidate[ "code" ], example[ "input" ])
score = 1.0 if result.output == example[ "expected" ] else 0.0
side_info = {
"Input" : example[ "input" ],
"Output" : result.output,
"Expected" : example[ "expected" ],
"Execution Time" : result.time_ms,
"Memory Used" : result.memory_mb,
}
if result.error:
side_info[ "Error" ] = str (result.error)
return score, side_info
Method 2: Use oa.log()
Log diagnostics imperatively during evaluation:
import gepa.optimize_anything as oa
def evaluate ( candidate , example ):
oa.log( f "Input: { example[ 'question' ] } " )
result = run_system(candidate, example)
oa.log( f "Output: { result.answer } " )
oa.log( f "Expected: { example[ 'correct_answer' ] } " )
if result.error:
oa.log( f "ERROR: { result.error } " )
score = compute_score(result, example)
return score # side_info automatically captured under "log" key
Captured output:
side_info = {
"log" : """
Input: What is 2+2?
Output: The answer is 5
Expected: 4
ERROR: Basic arithmetic failure
"""
}
oa.log() is thread-safe and works from child threads when properly propagated via oa.get_log_context() / oa.set_log_context().
Method 3: Automatic stdio Capture
Capture all print() statements automatically:
config = GEPAConfig(
engine = EngineConfig(
capture_stdio = True # Enable automatic capture
)
)
def evaluate ( candidate , example ):
print ( f "Testing on: { example[ 'input' ] } " ) # Captured to side_info["stdout"]
result = run_system(candidate, example)
print ( f "Result: { result.output } " ) # Also captured
return compute_score(result)
Captured output:
side_info = {
"stdout" : "Testing on: input_1 \n Result: output_xyz \n " ,
"stderr" : "" # Empty if no stderr
}
capture_stdio=True captures Python-level output only (sys.stdout/sys.stderr). C extensions or subprocesses that write directly to file descriptors require manual capture via oa.log().
When optimizing multiple parameters, provide targeted feedback for each:
def evaluate ( candidate , example ):
# candidate = {"system_prompt": "...", "few_shot_examples": "..."}
result = run_system(candidate, example)
score = compute_score(result)
side_info = {
# Top-level feedback (applies to all parameters)
"Input" : example[ "question" ],
"Output" : result.answer,
"Expected" : example[ "correct_answer" ],
# System prompt-specific feedback
"system_prompt_specific_info" : {
"scores" : { "instruction_following" : 0.7 },
"Analysis" : "Prompt too vague about output format" ,
"Suggestion" : "Add explicit formatting instructions"
},
# Few-shot example-specific feedback
"few_shot_examples_specific_info" : {
"scores" : { "coverage" : 0.4 },
"Analysis" : "No examples with negative numbers" ,
"Suggestion" : "Add edge case examples"
}
}
return score, side_info
During reflection on parameter X, GEPA merges:
Top-level fields (generic feedback)
X_specific_info fields (targeted feedback)
This gives the reflection LM both general context and parameter-specific diagnostics.
Visual ASI with Images
For visual tasks (rendering, charts, UI), include images in ASI:
from gepa import Image
import matplotlib.pyplot as plt
def evaluate ( candidate , example ):
# Generate SVG or render output
output_svg = render_svg(candidate[ "svg_code" ])
expected_svg = example[ "target_svg" ]
# Create comparison visualization
fig, (ax1, ax2) = plt.subplots( 1 , 2 )
ax1.imshow(parse_svg(output_svg))
ax1.set_title( "Generated" )
ax2.imshow(parse_svg(expected_svg))
ax2.set_title( "Expected" )
# Convert to PIL Image
from io import BytesIO
buf = BytesIO()
fig.savefig(buf, format = 'png' )
buf.seek( 0 )
comparison_image = PIL .Image.open(buf)
score = compute_visual_similarity(output_svg, expected_svg)
return score, {
"Input" : example[ "description" ],
"Comparison" : Image(comparison_image), # VLM will see this
"Feedback" : "Colors match, but proportions are off"
}
Image ASI requires a vision-language model (VLM) as the reflection LM. Set reflection_lm="openai/gpt-4o" or similar.
ASI vs Reward Shaping
In RL, reward shaping adds heuristic signals to guide learning:
# RL reward shaping
reward = goal_achieved * 10.0 # Sparse signal
reward += distance_to_goal * 0.1 # Dense guidance
reward -= collisions * 5.0 # Penalty
Problems with reward shaping:
Hard to design (requires domain expertise)
Brittle (small changes break learning)
Can introduce unintended incentives
Not interpretable (why was this action taken?)
ASI advantages:
Natural language diagnostics (easy to write)
Flexible (add/remove fields freely)
Interpretable (human-readable explanations)
Direct guidance (“fix this specific error”)
What Makes Good ASI?
ASI quality directly impacts optimization speed and quality:
✅ Good ASI
side_info = {
"Input" : "Translate 'Hello world' to French" ,
"Output" : "Salut monde" ,
"Expected" : "Bonjour le monde" ,
"Feedback" : """
Translation is too informal ('Salut' instead of 'Bonjour').
Missing article 'le' before 'monde'.
Context suggests formal register is appropriate.
""" ,
"Tone Score" : 0.3 , # Quantify formality
}
Why it’s good:
Specific error identification
Explanation of why it’s wrong
Context for the correct choice
Quantified metric for tracking
❌ Bad ASI
side_info = {
"Output" : "Salut monde" ,
"Feedback" : "Wrong translation"
}
Why it’s bad:
No explanation of why it’s wrong
Missing input/expected for context
No guidance on how to fix it
LLM has to guess the root cause
🎯 Excellent ASI
side_info = {
"Input" : { "text" : "Hello world" , "context" : "Business email" },
"Output" : "Salut monde" ,
"Expected" : "Bonjour le monde" ,
"Error Analysis" : """
1. Register mismatch: 'Salut' is informal, context requires formal
2. Grammatical error: Missing definite article 'le'
3. Vocabulary: 'monde' alone is correct but incomplete
""" ,
"Correction Strategy" : """
- Use 'Bonjour' for formal contexts (vs 'Salut' for casual)
- Always include articles: 'le monde' not just 'monde'
- Check context field for register requirements
""" ,
"scores" : {
"grammatical_correctness" : 0.5 ,
"tone_appropriateness" : 0.2 ,
"fluency" : 0.8
},
"Similar Failures" : [ "Example 3" , "Example 7" ], # Pattern
}
Why it’s excellent:
Structured error breakdown
Explicit correction strategy
Multi-objective metrics
Pattern identification across examples
Common ASI Patterns
Coding Tasks
side_info = {
"Input" : test_case,
"Generated Code" : candidate[ "code" ],
"Execution Result" : result.output,
"Expected Output" : test_case[ "expected" ],
"Compilation Errors" : result.compile_errors,
"Runtime Errors" : result.runtime_errors,
"Test Status" : "FAILED" ,
"Profiling" : {
"time_ms" : result.time_ms,
"memory_mb" : result.memory_mb,
},
"Coverage" : result.line_coverage,
}
Agent Tasks
side_info = {
"User Query" : example[ "query" ],
"Agent Trajectory" : [
{ "action" : "search" , "input" : "Paris weather" , "output" : "..." },
{ "action" : "answer" , "input" : "..." , "output" : result.answer},
],
"Final Answer" : result.answer,
"Expected Answer" : example[ "correct_answer" ],
"Tool Errors" : result.tool_errors,
"Reasoning Quality" : human_eval_score(result),
"Efficiency" : f "Used { len (result.actions) } actions (optimal: 2)" ,
}
Math/Reasoning Tasks
side_info = {
"Problem" : example[ "question" ],
"Chain of Thought" : result.reasoning,
"Final Answer" : result.answer,
"Correct Answer" : example[ "solution" ],
"Verification" : verify_solution(result),
"Error Type" : classify_error(result, example),
"Difficulty" : example[ "difficulty_level" ],
"scores" : {
"correctness" : 1.0 if correct else 0.0 ,
"reasoning_quality" : rate_reasoning(result.reasoning),
}
}
ASI in Practice: Example Flow
Iteration 1:
# Candidate fails on 2/3 examples in minibatch
side_info = [
{
"Input" : "2+2" ,
"Output" : "The answer is 5" ,
"Expected" : "4" ,
"Error" : "Basic arithmetic"
},
{
"Input" : "10*10" ,
"Output" : "100" ,
"Expected" : "100" ,
"Feedback" : "Correct!"
},
{
"Input" : "-3 + 5" ,
"Output" : "The answer is -8" ,
"Expected" : "2" ,
"Error" : "Wrong sign handling"
}
]
# LLM reflects:
"Two failures, both arithmetic. First: basic error (2+2=5).
Second: negative number handling incorrect. Success on simple
multiplication. Propose: Add step - by - step arithmetic verification. "
# New candidate adds:
"Before answering, verify: if addition, add left-to-right.
For negative numbers, note the sign explicitly. "
Iteration 2:
# New candidate tested on same minibatch
side_info = [
{ "Input" : "2+2" , "Output" : "4" , "Expected" : "4" , "Feedback" : "Fixed!" },
{ "Input" : "10*10" , "Output" : "100" , "Expected" : "100" , "Feedback" : "Still correct" },
{ "Input" : "-3 + 5" , "Output" : "2" , "Expected" : "2" , "Feedback" : "Fixed!" },
]
# Minibatch score: 0/3 → 3/3. Accepted!
# Full validation evaluation determines final score.
Next Steps
How GEPA Works See how ASI fits into the full optimization loop
Reflective Evolution Understand how LLMs use ASI to propose improvements
Building Adapters Learn how to capture rich ASI in your adapter
Examples See real ASI examples from GEPA use cases