Skip to main content
If your system uses a Python SDK instead of an OpenAI-compatible proxy, wrap it in a class that implements the System protocol. You don’t subclass anything — just implement two things: a name property and a process() method.

The System protocol

from typing import Any, Protocol

class System(Protocol):
    """Anything that transforms context."""

    @property
    def name(self) -> str: ...

    def process(self, example: dict[str, Any]) -> dict[str, Any]: ...
process() receives an example dict and must return a dict with:
  • All original keys preserved (use {**example, ...} to keep them)
  • A "response" key containing the LLM’s output string
  • Optionally a modified "context" key if your system compresses or rewrites the context

Simple example: string truncator

A truncator that trims context to max_chars characters before sending it to an LLM:
from context_bench import evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore

class TruncateSystem:
    name = "truncate-4000"

    def __init__(self, max_chars: int = 4000):
        self.max_chars = max_chars

    def process(self, example: dict) -> dict:
        truncated = example["context"][: self.max_chars]
        response = call_your_llm(truncated, example.get("question", ""))
        return {**example, "context": truncated, "response": response}


result = evaluate(
    systems=[TruncateSystem()],
    dataset=my_dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1")],
)
print(result.summary)
Use {**example, "context": modified, "response": output} to preserve all original keys. Evaluators like AnswerQuality look for "answer" in the original example and "response" in the processed output.

Wrapping a third-party SDK

Some systems (like Compresr) expose a Python SDK instead of a proxy. Wrap them in the same pattern:
from compresr import CompressionClient

class CompresrSystem:
    name = "compresr"

    def __init__(self, api_key: str):
        self.client = CompressionClient(api_key=api_key)

    def process(self, example: dict) -> dict:
        compressed = self.client.generate(
            context=example["context"],
            question=example.get("question", ""),
        )
        return {**example, "context": compressed}
This example modifies "context" but does not add a "response" key. That’s valid if you want to measure compression quality separately — the runner will still score whatever keys your evaluators look at.

Full evaluate() call

Pass your custom system alongside any evaluators and metrics:
from context_bench import evaluate
from context_bench.evaluators import AnswerQuality, MathEquivalence
from context_bench.metrics import MeanScore, Latency, PerDatasetBreakdown

class MyCompressor:
    name = "my-compressor"

    def process(self, example: dict) -> dict:
        compressed = my_compress(example["context"])
        return {**example, "context": compressed, "response": compressed}


result = evaluate(
    systems=[MyCompressor()],
    dataset=my_data,
    evaluators=[AnswerQuality(), MathEquivalence()],
    metrics=[
        MeanScore(score_field="f1"),
        Latency(),
        PerDatasetBreakdown(score_field="f1"),
    ],
    max_workers=4,
    cache_dir=".cache/",
)

print(result.to_json())
df = result.to_dataframe()  # requires pandas

Multi-turn systems

For multi-turn datasets like MT-Bench, implement an optional process_conversation() method. The runner calls it automatically when the example has a "turns" list.
class MyMultiTurnSystem:
    name = "my-chatbot"

    def process(self, example: dict) -> dict:
        """Called for single-turn examples."""
        return {**example, "response": call_my_api(example["context"])}

    def process_conversation(self, turns: list[dict]) -> list[dict]:
        """Called for multi-turn examples (e.g., MT-Bench).

        Args:
            turns: List of user-turn dicts {"role": "user", "content": ...}.

        Returns:
            List of assistant-response dicts {"role": "assistant", "content": ...},
            one per user turn.
        """
        history = []
        responses = []
        for turn in turns:
            history.append(turn)
            reply = call_my_api_with_history(history)
            history.append({"role": "assistant", "content": reply})
            responses.append({"role": "assistant", "content": reply})
        return responses
Run MT-Bench with an LLM judge to score the conversation quality:
context-bench \
  --proxy http://localhost:7878 \
  --dataset mt-bench \
  --judge-url http://localhost:9090 \
  --score-field judge_score
If process_conversation() is not defined, the runner falls back to calling process() with the full turns list embedded in the example dict.

Build docs developers (and LLMs) love