dspy subcommand runs a factorial sweep of DSPy optimizers across context-bench datasets. It tracks compile cost, inference scores, and task features, producing data useful for optimizer selection meta-learning.
Usage
Flags
DSPy optimizer to benchmark. Repeatable — omit to run all optimizers.Available optimizers:
LabeledFewShotBootstrapFewShotBootstrapFewShotWithRandomSearchCOPROMIPROv2SIMBAGEPA
Dataset to benchmark on. Repeatable. When omitted, all datasets compatible with the selected optimizers are used.
Optimization budget tier. Repeatable — pass multiple tiers to sweep across them. One of
light, medium, or heavy.Higher budget tiers allow optimizers more compilation iterations and candidates.Random seed for the optimizer. Repeatable — pass multiple seeds to evaluate variance across runs.
Language model used for DSPy inference during both compilation and evaluation.
Separate model for instruction generation. Applies to optimizers that generate prompt instructions during compilation:
MIPROv2, COPRO, and SIMBA. When omitted, --model is used for all steps.Directory for compiled program cache. Compiled programs are stored here and reused on subsequent runs to avoid recompilation.
Maximum training examples per dataset used during compilation.
Maximum validation examples per dataset used during compilation.
Maximum test examples per dataset used during evaluation. Useful for quick runs.
Abort the sweep if the estimated cost exceeds this amount in dollars. Prevents runaway spend on large sweeps.
Compile optimizers but skip the evaluation step. Use this to pre-populate the cache before a full evaluation run.
Include GEPA ablation runs that omit the feedback metric. Adds extra rows to the results for analysis.
Output format. One of
table or json.HTML output is not available for the
dspy subcommand. Use table or json.Examples
How the sweep works
The sweep runs every combination of the selected optimizers, datasets, budgets, and seeds. For each combination:- The optimizer compiles a DSPy program using the training split.
- The compiled program is evaluated on the test split.
- Results are written to
--outputwith compile metadata and inference scores.
