- CLI
- Python API
Run the benchmark
Point context-bench at your proxy and choose a dataset:The
-n 50 flag limits evaluation to 50 examples for a quick smoke test. Remove it to evaluate the full dataset.Read the output
You’ll see a markdown table with aggregated metrics:
| Column | What it means |
|---|---|
mean_score | Average F1 score across all examples (0–1) |
pass_rate | Fraction of examples scoring above the threshold (default: 0.7) |
compression_ratio | 1 - (output_tokens / input_tokens) — positive means tokens were saved |
cost_of_pass | Tokens spent per successful completion |
Compare two proxies
Run both systems in a single command to get a side-by-side comparison:What’s next
CLI reference
All CLI flags and subcommands
Core concepts
How the Dataset → System → Evaluator → Metric pipeline works
Built-in datasets
42 datasets ready to use
Custom systems
Benchmark any system, not just proxies
