The System protocol defines what you benchmark. Implement process() to wrap any LLM context transformation.
A system is anything that transforms LLM context. Prompt compressors, memory managers, RAG rerankers, proxy servers — if it modifies what goes into a context window, it is a system.
Forwards requests through any OpenAI-compatible proxy endpoint. Used by the CLI’s --proxy flag.
ClaudeCLI
Calls the Claude CLI directly. Useful for testing Claude without an API key or proxy.
from context_bench import OpenAIProxysystem = OpenAIProxy( base_url="http://localhost:7878", model="gpt-4", api_key="sk-...", # or set OPENAI_API_KEY env var system_prompt="Be concise.", # prepended as system message extra_body={"temperature": 0}, # any additional request params name="kompact",)
For multi-turn datasets (such as MT-Bench), the runner checks whether the system has a process_conversation() method. If present, it is called with the list of conversation turns instead of process().
class MyMultiTurnSystem: name = "my-chatbot" def process(self, example): # Fallback for single-turn examples return {**example, "response": call_my_api(example["context"])} def process_conversation(self, turns): """Called for multi_turn examples (e.g., MT-Bench). Args: turns: list of {"role": "user"|"assistant", "content": str} dicts Returns: list of {"role": "assistant", "content": str} dicts """ history = [] responses = [] for turn in turns: history.append(turn) reply = call_my_api_with_history(history) history.append({"role": "assistant", "content": reply}) responses.append({"role": "assistant", "content": reply}) return responses
The runner takes the final assistant response from process_conversation() and stores it as "response" in the EvalRow.
For stateful memory evaluation (conversation recall, long-term memory), implement the MemorySystem protocol instead:
@runtime_checkableclass MemorySystem(Protocol): """A stateful memory system evaluated over conversation histories.""" @property def name(self) -> str: ... def reset(self) -> None: """Clear all stored memory. Called once before each conversation.""" ... def ingest(self, turns: list[dict[str, Any]]) -> None: """Store a conversation history. Args: turns: List of {"role": "user"|"assistant", "content": str} dicts in chronological order. """ ... def query(self, question: str) -> str: """Answer a question using stored memory. Returns: The system's answer string. """ ...
The runner calls reset() between conversations, ingest() to load the conversation history, then query() for each QA pair.
All interfaces in context-bench are typing.Protocol. Implement the methods on any class you choose — never subclass a context-bench base class.