Skip to main content
context-bench accepts any Iterable[dict] as a dataset. You can load data from a local file, build it in memory, or register a named loader that behaves like a built-in dataset.

Required fields

Every example dict must have at minimum:
id
string | int
required
Unique identifier for the example. Used for deduplication and result caching.
context
string
required
The text that will be passed through the system under test. This is the input the system compresses, summarizes, or otherwise transforms.

Optional evaluator fields

Evaluators look for additional fields to compute their scores. Include the fields that match the evaluators you use:
FieldUsed byPurpose
answerAnswerQuality, MathEquivalenceGround-truth answer string
questionAnswerQuality, LLMJudgeQuestion text sent to the system
correct_letterMultipleChoiceAccuracyExpected choice letter (AJ)
choicesMultipleChoiceAccuracyList of answer choices
testCodeExecutionTest code to verify the response
entry_pointCodeExecutionFunction name to call in the test
instruction_id_listIFEvalCheckerList of constraint IDs
kwargsIFEvalCheckerPer-constraint keyword arguments
datasetPerDatasetBreakdownDataset label for per-dataset slicing

Local JSONL files

Pass a path ending in .jsonl directly to --dataset. Each line must be valid JSON.
context-bench --proxy http://localhost:7878 --dataset ./my_data.jsonl
Example JSONL file:
{"id": "q1", "context": "Alice went to Paris in 2023.", "question": "When did Alice go to Paris?", "answer": "2023"}
{"id": "q2", "context": "The Eiffel Tower is 330 meters tall.", "question": "How tall is the Eiffel Tower?", "answer": "330 meters"}
{"id": "q3", "context": "Python was created by Guido van Rossum.", "question": "Who created Python?", "answer": "Guido van Rossum"}
You can also load JSONL files from the Python API:
from context_bench.datasets.local import load_jsonl

dataset = load_jsonl("./my_data.jsonl", n=100)  # n limits the number of examples

Python API

Pass any list[dict] as the dataset argument to evaluate():
from context_bench import evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore

my_dataset = [
    {
        "id": "ex-1",
        "context": "The speed of light is approximately 299,792 km/s.",
        "question": "What is the speed of light?",
        "answer": "299,792 km/s",
    },
    {
        "id": "ex-2",
        "context": "Water boils at 100°C at sea level.",
        "question": "At what temperature does water boil?",
        "answer": "100°C",
    },
]

result = evaluate(
    systems=[my_system],
    dataset=my_dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1")],
)
print(result.summary)

Registering a named dataset

To make your dataset available by name (like a built-in), register a loader function with the registry:
from context_bench.registry import registry

def load_my_dataset(n: int | None = None) -> list[dict]:
    """Load examples from your data source."""
    examples = [
        {"id": "1", "context": "...", "answer": "..."},
        # ...
    ]
    return examples[:n] if n else examples

registry.register("dataset", "my_dataset", load_my_dataset)
After registration, load it programmatically using the registry helpers:
from context_bench.registry import load_dataset

examples = load_dataset("my_dataset", n=50)
Tag each example with example["dataset"] = "my_dataset" before passing to evaluate(). This enables PerDatasetBreakdown to slice scores correctly when you mix your dataset with built-in ones.

Minimal example

A single-file custom dataset that scores an extractive QA system:
from context_bench import evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore, PassRate

class Truncator:
    """Trivial system that keeps only the first 200 characters."""
    name = "truncator"

    def process(self, example: dict) -> dict:
        return {**example, "context": example["context"][:200]}

dataset = [
    {"id": "1", "context": "Jupiter is the largest planet in the solar system, with a mass more than twice that of all other planets combined.", "question": "What is the largest planet?", "answer": "Jupiter"},
    {"id": "2", "context": "The Amazon River is the largest river by discharge volume of water in the world.", "question": "Which river has the largest discharge?", "answer": "Amazon River"},
]

result = evaluate(
    systems=[Truncator()],
    dataset=dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1"), PassRate(threshold=0.5, score_field="f1")],
)
print(result.summary)

Build docs developers (and LLMs) love