Comparing multiple systems

Running two or more systems in one invocation lets you compare quality, compression, and cost on identical data. context-bench automatically adds Pareto ranking when multiple systems are present and per-dataset breakdowns when multiple datasets are requested.

Two systems

Pass --proxy and --name pairs. Names are optional — if omitted, context-bench derives them from the URL hostname.

context-bench \
  --proxy http://localhost:7878 --name kompact \
  --proxy http://localhost:8787 --name headroom \
  --dataset hotpotqa -n 50

Three systems

Add a baseline (e.g. a passthrough proxy or the raw model) alongside your candidates:

context-bench \
  --proxy http://localhost:9091 --name baseline \
  --proxy http://localhost:7878 --name kompact \
  --proxy http://localhost:7879 --name headroom \
  --dataset bfcl --model haiku --score-field contains

Output:

# Evaluation Results

| System   | mean_score | pass_rate | compression_ratio | cost_of_pass |
|----------|-----------|-----------|-------------------|--------------|
| Baseline | 0.2930    | 0.2930    | -0.3264           | 4,291        |
| Kompact  | 0.3640    | 0.3640    | -0.1345           | 2,447        |
| Headroom | 0.3140    | 0.3140    | -0.1793           | 3,815        |

*1,431 examples evaluated*

Multiple datasets

Repeat --dataset to evaluate across several benchmarks in a single run. A per-dataset breakdown table is added automatically.

context-bench \
  --proxy http://localhost:7878 --name kompact \
  --proxy http://localhost:8787 --name baseline \
  --dataset mmlu --dataset arc-challenge --dataset hellaswag \
  --score-field mc_accuracy -n 100

Pareto frontier analysis

When two or more systems are compared, context-bench computes a pareto_rank for each system. The Pareto frontier is the set of systems that are not dominated on both the quality axis (mean_score) and the cost axis (cost_of_pass). A rank of 1 means the system is on the frontier. In the example above, Kompact has the highest mean_score (0.3640) and the lowest cost_of_pass (2,447) — it dominates both other systems and sits alone on the Pareto frontier.

Pareto ranking is computed after evaluation completes. It requires at least two systems and uses mean_score vs cost_of_pass by default.

Python API

Pass a list of systems to evaluate(). Filter the result to one system with .filter().

from context_bench import OpenAIProxy, evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore, PassRate, CompressionRatio, CostOfPass

kompact = OpenAIProxy("http://localhost:7878", model="gpt-4", name="kompact")
headroom = OpenAIProxy("http://localhost:8787", model="gpt-4", name="headroom")

result = evaluate(
    systems=[kompact, headroom],
    dataset=my_dataset,
    evaluators=[AnswerQuality()],
    metrics=[
        MeanScore(score_field="f1"),
        PassRate(score_field="f1"),
        CompressionRatio(),
        CostOfPass(score_field="f1"),
    ],
)

# Full summary
print(result.summary)

# Filter to one system
headroom_rows = result.filter(system="headroom")

Per-dataset breakdown

When multiple datasets are loaded, PerDatasetBreakdown is added automatically by the CLI. In Python, add it explicitly:

from context_bench.metrics import PerDatasetBreakdown

result = evaluate(
    systems=[kompact, headroom],
    dataset=combined_dataset,   # tagged with "dataset" key per example
    evaluators=[AnswerQuality()],
    metrics=[
        MeanScore(score_field="f1"),
        PerDatasetBreakdown(score_field="f1"),
    ],
)

Choosing the right score field

Different task types use different score fields. Pass --score-field (CLI) or score_field= (Python) to select the right one.

Task type	Recommended `score_field`	Datasets
Open-domain QA	`f1`	hotpotqa, triviaqa, musique, narrativeqa
Multiple choice	`mc_accuracy`	mmlu, arc-challenge, hellaswag, winogrande, gpqa
Code generation	`pass_at_1`	humaneval, mbpp
Math	`math_equiv`	math, gsm8k, mgsm
Summarization	`rouge_l_f1`	multi-news, dialogsum, govreport
Instruction following	`ifeval_strict`	ifeval
LLM judge	`judge_score`	alpaca-eval, mt-bench

Get Started

CLI Reference

Core Concepts

Guides

Comparing multiple systems

Two systems

Three systems

Multiple datasets

Pareto frontier analysis

Python API

Per-dataset breakdown

Choosing the right score field

Build docs developers (and LLMs) love

Get Started

CLI Reference

Core Concepts

Guides

​Two systems

​Three systems

​Multiple datasets

​Pareto frontier analysis

​Python API

​Per-dataset breakdown

​Choosing the right score field

Build docs developers (and LLMs) love

Two systems

Three systems

Multiple datasets

Pareto frontier analysis

Python API

Per-dataset breakdown

Choosing the right score field