DSPy Subcommand

The dspy subcommand runs a factorial sweep of DSPy optimizers across context-bench datasets. It tracks compile cost, inference scores, and task features, producing data useful for optimizer selection meta-learning.

Usage

context-bench dspy --optimizer NAME --dataset NAME [options]

Flags

--optimizer

string

default:"all"

DSPy optimizer to benchmark. Repeatable — omit to run all optimizers.Available optimizers:

LabeledFewShot
BootstrapFewShot
BootstrapFewShotWithRandomSearch
COPRO
MIPROv2
SIMBA
GEPA

context-bench dspy --optimizer MIPROv2 --optimizer SIMBA --dataset gsm8k

--dataset

string

default:"all compatible"

Dataset to benchmark on. Repeatable. When omitted, all datasets compatible with the selected optimizers are used.

--budget

string

default:"light"

Optimization budget tier. Repeatable — pass multiple tiers to sweep across them. One of light, medium, or heavy.Higher budget tiers allow optimizers more compilation iterations and candidates.

--seed

integer

default:"[42]"

Random seed for the optimizer. Repeatable — pass multiple seeds to evaluate variance across runs.

--seed 42 --seed 123 --seed 456

--model

string

default:"claude-haiku-4-5-20251001"

Language model used for DSPy inference during both compilation and evaluation.

--prompt-model

string

Separate model for instruction generation. Applies to optimizers that generate prompt instructions during compilation: MIPROv2, COPRO, and SIMBA. When omitted, --model is used for all steps.

--cache-dir

string

default:".dspy_cache"

Directory for compiled program cache. Compiled programs are stored here and reused on subsequent runs to avoid recompilation.

--max-train

integer

default:"500"

Maximum training examples per dataset used during compilation.

--max-val

integer

default:"200"

Maximum validation examples per dataset used during compilation.

-n / --max-examples

integer

default:"all"

Maximum test examples per dataset used during evaluation. Useful for quick runs.

--cost-cap

number

Abort the sweep if the estimated cost exceeds this amount in dollars. Prevents runaway spend on large sweeps.

--compile-only

boolean

default:"false"

Compile optimizers but skip the evaluation step. Use this to pre-populate the cache before a full evaluation run.

--include-ablations

boolean

default:"false"

Include GEPA ablation runs that omit the feedback metric. Adds extra rows to the results for analysis.

--output

string

default:"table"

Output format. One of table or json.

HTML output is not available for the dspy subcommand. Use table or json.

Examples

context-bench dspy --optimizer MIPROv2 --dataset hotpotqa --budget light

How the sweep works

The sweep runs every combination of the selected optimizers, datasets, budgets, and seeds. For each combination:

The optimizer compiles a DSPy program using the training split.
The compiled program is evaluated on the test split.
Results are written to --output with compile metadata and inference scores.

Completed, failed, and skipped runs are reported in the final summary line:

Sweep complete: 12 completed, 0 failed, 0 skipped

Use --compile-only on a first pass to warm the cache, then run the full evaluation separately. This is useful when compilation is expensive and you want to reuse compiled programs across multiple evaluation configurations.

Get Started

CLI Reference

Core Concepts

Guides

Usage

Flags

Examples

How the sweep works

Build docs developers (and LLMs) love

Get Started

CLI Reference

Core Concepts

Guides

​Usage

​Flags

​Examples

​How the sweep works

Build docs developers (and LLMs) love

Usage

Flags

Examples

How the sweep works