Skip to main content
A system is anything that transforms LLM context. Prompt compressors, memory managers, RAG rerankers, proxy servers — if it modifies what goes into a context window, it is a system.

The System protocol

from typing import Any, Protocol, runtime_checkable

@runtime_checkable
class System(Protocol):
    """Anything that transforms context."""

    @property
    def name(self) -> str: ...

    def process(self, example: dict[str, Any]) -> dict[str, Any]: ...
The @runtime_checkable decorator means you can use isinstance(obj, System) to verify compliance at runtime.

Built-in systems

OpenAIProxy

Forwards requests through any OpenAI-compatible proxy endpoint. Used by the CLI’s --proxy flag.

ClaudeCLI

Calls the Claude CLI directly. Useful for testing Claude without an API key or proxy.
from context_bench import OpenAIProxy

system = OpenAIProxy(
    base_url="http://localhost:7878",
    model="gpt-4",
    api_key="sk-...",             # or set OPENAI_API_KEY env var
    system_prompt="Be concise.",  # prepended as system message
    extra_body={"temperature": 0}, # any additional request params
    name="kompact",
)

Implementing a custom system

Any class with a name property and a process() method satisfies the protocol. No imports from context-bench are required. Required:
  • name — a string property identifying the system in results and reports
  • process(example: dict) -> dict — transform the example and return a new dict
Convention:
  • Add or overwrite the "response" key in the returned dict with the system’s output
  • Pass all other keys through unchanged using {**example, ...}

Simplest possible system

A string truncator that limits context to 500 characters:
class TruncateSystem:
    name = "truncate-500"

    def process(self, example):
        truncated = example["context"][:500]
        return {**example, "context": truncated, "response": truncated}

Wrapping a third-party SDK

Systems that use a Python SDK instead of a proxy endpoint wrap the SDK call in process():
from compresr import CompressionClient

class CompresrSystem:
    name = "compresr"

    def __init__(self, api_key):
        self.client = CompressionClient(api_key=api_key)

    def process(self, example):
        compressed = self.client.generate(
            context=example["context"],
            question=example.get("question", ""),
        )
        return {**example, "context": compressed}
Then pass it to evaluate() like any other system:
from context_bench import evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore, CompressionRatio

result = evaluate(
    systems=[CompresrSystem(api_key="...")],
    dataset=my_dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1"), CompressionRatio()],
)

Multi-turn systems

For multi-turn datasets (such as MT-Bench), the runner checks whether the system has a process_conversation() method. If present, it is called with the list of conversation turns instead of process().
class MyMultiTurnSystem:
    name = "my-chatbot"

    def process(self, example):
        # Fallback for single-turn examples
        return {**example, "response": call_my_api(example["context"])}

    def process_conversation(self, turns):
        """Called for multi_turn examples (e.g., MT-Bench).

        Args:
            turns: list of {"role": "user"|"assistant", "content": str} dicts

        Returns:
            list of {"role": "assistant", "content": str} dicts
        """
        history = []
        responses = []
        for turn in turns:
            history.append(turn)
            reply = call_my_api_with_history(history)
            history.append({"role": "assistant", "content": reply})
            responses.append({"role": "assistant", "content": reply})
        return responses
The runner takes the final assistant response from process_conversation() and stores it as "response" in the EvalRow.

The MemorySystem protocol

For stateful memory evaluation (conversation recall, long-term memory), implement the MemorySystem protocol instead:
@runtime_checkable
class MemorySystem(Protocol):
    """A stateful memory system evaluated over conversation histories."""

    @property
    def name(self) -> str: ...

    def reset(self) -> None:
        """Clear all stored memory. Called once before each conversation."""
        ...

    def ingest(self, turns: list[dict[str, Any]]) -> None:
        """Store a conversation history.

        Args:
            turns: List of {"role": "user"|"assistant", "content": str} dicts
                in chronological order.
        """
        ...

    def query(self, question: str) -> str:
        """Answer a question using stored memory.

        Returns:
            The system's answer string.
        """
        ...
The runner calls reset() between conversations, ingest() to load the conversation history, then query() for each QA pair.
All interfaces in context-bench are typing.Protocol. Implement the methods on any class you choose — never subclass a context-bench base class.

Build docs developers (and LLMs) love