Skip to main content

Math Problem Optimization Tutorial

Learn how to use GEPA to optimize system prompts for mathematical reasoning tasks. In this tutorial, we’ll improve GPT-4.1 Mini’s performance on AIME (American Invitational Mathematics Examination) problems from 46.6% to 56.6% accuracy through prompt optimization alone.

Overview

This tutorial demonstrates:
  • Training on AIME 2022-2024 problems
  • Testing on AIME 2025 (held-out set)
  • Optimizing system prompts for complex mathematical reasoning
  • Achieving significant gains without model fine-tuning
1

Install GEPA

First, install GEPA and load the AIME dataset:
pip install gepa
2

Load the Dataset

GEPA provides a built-in AIME dataset loader:
import gepa

# Load AIME datasets
trainset, valset, testset = gepa.examples.aime.init_dataset()

print(f"Training examples: {len(trainset)}")
print(f"Validation examples: {len(valset)}")
print(f"Test examples: {len(testset)}")
The dataset includes:
  • Training: AIME validation problems (AI-MO/aimo-validation-aime)
  • Test: AIME 2025 problems (MathArena/aime_2025)
  • Each example contains the problem, solution, and answer in format ### <answer>
3

Define the Seed Prompt

Start with a basic system prompt:
seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}
This simple prompt serves as our baseline. GEPA will evolve it into a detailed, strategy-rich prompt.
4

Run GEPA Optimization

Optimize the prompt using GEPA:
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

print("Optimized prompt:", result.best_candidate['system_prompt'])
print(f"Best validation score: {result.val_aggregate_scores[result.best_idx]:.2%}")
Key Parameters:
  • task_lm: The model being optimized (GPT-4.1 Mini)
  • reflection_lm: The model generating improved prompts (GPT-5)
  • max_metric_calls: Number of optimization iterations (150)
5

Understand the Results

Performance Improvement:
  • Baseline: 46.6% accuracy with simple prompt
  • Optimized: 56.6% accuracy with GEPA-optimized prompt
  • Gain: +10 percentage points from prompt optimization alone
The optimized prompt includes:
  • Domain-specific strategies for base conversion, palindromes, symmetric sums
  • Step-by-step problem-solving guidance
  • Common pitfall warnings and verification steps
  • Structured output format requirements
6

Example: Optimized AIME Prompt

Here’s an excerpt from the optimized prompt that GEPA discovered:
You will be given one math problem as plain text. Your job is to solve it 
correctly and return:

- reasoning: a concise, logically ordered solution that uses identities/structure 
  to avoid brute force, ends with a quick verification.
- answer: the final requested number/expression only (no extra words).

Domain-specific strategies:

1) Base-conversion/digit rearrangement:
   - Translate positional notation correctly
   - Enforce digit ranges strictly
   - Use modular constraints to prune (e.g., mod 9 often collapses coefficients)
   - Solve within digit bounds and verify numerically

2) Palindromes across bases:
   - Bound the base length by magnitude
   - Characterize palindromes algebraically
   - For "greatest", check candidates in descending order

3) Symmetric sums with fixed constraints:
   - Use identities to compress expressions
   - Convert relations among ab+bc+ca and abc
   - Count ordered solutions carefully

4) Intersecting families of subsets:
   - Empty set cannot be included
   - Complement pairs cannot both be present
   - Use size-based pigeonhole arguments

Quality checks:
- Verify digit/base constraints numerically
- For extremal problems, provide both bound and construction
- For counting, handle ordered vs unordered cases
- Justify optimality structurally
7

Test on AIME 2025

Evaluate the optimized prompt on held-out test data:
from gepa.adapters.default_adapter import DefaultAdapter

# Create adapter for evaluation
adapter = DefaultAdapter(
    task_lm="openai/gpt-4.1-mini",
    metric=lambda pred, ref: int(pred.strip() == ref.strip())
)

# Test on AIME 2025
test_result = adapter.evaluate(
    batch=testset,
    candidate=result.best_candidate,
    capture_traces=False
)

test_accuracy = sum(test_result.scores) / len(test_result.scores)
print(f"AIME 2025 Test Accuracy: {test_accuracy:.2%}")

Key Takeaways

Significant Gains

10 percentage point improvement on AIME 2025 from prompt optimization alone, without fine-tuning or architectural changes.

Domain Knowledge

GEPA automatically discovers problem-solving strategies, common pitfalls, and verification steps specific to mathematical reasoning.

Generalization

The optimized prompt generalizes to unseen AIME 2025 problems after training on 2022-2024 data.

Efficient Search

Achieves strong results with just 150 optimization iterations, far fewer than RL-based methods (5,000-25,000+ evaluations).

Advanced Usage

Use with DSPy

For more complex AI pipelines, integrate GEPA through DSPy:
import dspy

optimizer = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)
optimized_program = optimizer.compile(
    student=MyProgram(), 
    trainset=trainset, 
    valset=valset
)
See the DSPy GEPA tutorials for executable notebooks.

Custom Metrics

Define custom evaluation metrics for your domain:
def math_metric(prediction, ground_truth, trace=None):
    """Custom metric that handles multiple answer formats"""
    pred_answer = extract_answer(prediction)
    true_answer = extract_answer(ground_truth)
    
    # Exact match
    if pred_answer == true_answer:
        return 1.0
    
    # Numerical equivalence
    if are_numerically_equivalent(pred_answer, true_answer):
        return 0.9
    
    return 0.0

GEPA Paper

Read the full research paper on reflective prompt evolution

DSPy Tutorials

Complete AIME tutorial with executable notebooks

Simple Prompt Tutorial

Learn basic prompt optimization concepts

Agent Architecture

Optimize entire agent systems, not just prompts

Next Steps

  • Try optimizing prompts for other mathematical benchmarks (MATH, GSM8K)
  • Experiment with different reflection models and budgets
  • Combine with other optimization techniques for even better results
  • Explore the full DSPy Program adapter for optimizing entire reasoning chains

Build docs developers (and LLMs) love