Skip to main content
The dspy subcommand runs a factorial sweep of DSPy optimizers across context-bench datasets. It tracks compile cost, inference scores, and task features, producing data useful for optimizer selection meta-learning.

Usage

context-bench dspy --optimizer NAME --dataset NAME [options]

Flags

--optimizer
string
default:"all"
DSPy optimizer to benchmark. Repeatable — omit to run all optimizers.Available optimizers:
  • LabeledFewShot
  • BootstrapFewShot
  • BootstrapFewShotWithRandomSearch
  • COPRO
  • MIPROv2
  • SIMBA
  • GEPA
context-bench dspy --optimizer MIPROv2 --optimizer SIMBA --dataset gsm8k
--dataset
string
default:"all compatible"
Dataset to benchmark on. Repeatable. When omitted, all datasets compatible with the selected optimizers are used.
--budget
string
default:"light"
Optimization budget tier. Repeatable — pass multiple tiers to sweep across them. One of light, medium, or heavy.Higher budget tiers allow optimizers more compilation iterations and candidates.
--seed
integer
default:"[42]"
Random seed for the optimizer. Repeatable — pass multiple seeds to evaluate variance across runs.
--seed 42 --seed 123 --seed 456
--model
string
default:"claude-haiku-4-5-20251001"
Language model used for DSPy inference during both compilation and evaluation.
--prompt-model
string
Separate model for instruction generation. Applies to optimizers that generate prompt instructions during compilation: MIPROv2, COPRO, and SIMBA. When omitted, --model is used for all steps.
--cache-dir
string
default:".dspy_cache"
Directory for compiled program cache. Compiled programs are stored here and reused on subsequent runs to avoid recompilation.
--max-train
integer
default:"500"
Maximum training examples per dataset used during compilation.
--max-val
integer
default:"200"
Maximum validation examples per dataset used during compilation.
-n / --max-examples
integer
default:"all"
Maximum test examples per dataset used during evaluation. Useful for quick runs.
--cost-cap
number
Abort the sweep if the estimated cost exceeds this amount in dollars. Prevents runaway spend on large sweeps.
--compile-only
boolean
default:"false"
Compile optimizers but skip the evaluation step. Use this to pre-populate the cache before a full evaluation run.
--include-ablations
boolean
default:"false"
Include GEPA ablation runs that omit the feedback metric. Adds extra rows to the results for analysis.
--output
string
default:"table"
Output format. One of table or json.
HTML output is not available for the dspy subcommand. Use table or json.

Examples

context-bench dspy --optimizer MIPROv2 --dataset hotpotqa --budget light

How the sweep works

The sweep runs every combination of the selected optimizers, datasets, budgets, and seeds. For each combination:
  1. The optimizer compiles a DSPy program using the training split.
  2. The compiled program is evaluated on the test split.
  3. Results are written to --output with compile metadata and inference scores.
Completed, failed, and skipped runs are reported in the final summary line:
Sweep complete: 12 completed, 0 failed, 0 skipped
Use --compile-only on a first pass to warm the cache, then run the full evaluation separately. This is useful when compilation is expensive and you want to reuse compiled programs across multiple evaluation configurations.

Build docs developers (and LLMs) love