EvalResult

from context_bench import EvalResult, EvalRow

EvalResult is a dataclass returned by evaluate() and evaluate_memory(). It contains every scored row, aggregated summary statistics, timing data, and the run configuration.

@dataclass
class EvalResult:
    rows: list[EvalRow]
    summary: dict[str, dict[str, float]]
    timing: dict[str, float]
    config: dict[str, Any]

Fields

rows

list[EvalRow]

required

All scored rows from the evaluation run. Each row corresponds to one (system, example) pair. For memory evaluations, each row corresponds to one (system, example, query) triplet.

summary

dict[str, dict[str, float]]

required

Per-system metric summaries. The outer key is the system name; the inner dict maps metric names to their computed values.

# Structure: {system_name: {metric_name: value}}
result.summary["kompact"]["mean_score"]       # e.g. 0.734
result.summary["kompact"]["pass_rate"]        # e.g. 0.612
result.summary["kompact"]["compression_ratio"] # e.g. 0.45

If no metrics were passed to evaluate(), this dict contains empty dicts per system.

timing

dict[str, float]

Wall-clock time in seconds spent processing each system. Key is the system name.

result.timing["kompact"]  # e.g. 42.3 (seconds)

config

dict[str, Any]

Snapshot of the run configuration for reproducibility:

"systems" — list of system names
"evaluators" — list of evaluator names
"metrics" — list of metric names
"num_examples" — number of examples evaluated
"max_workers" — thread count used (1 for sequential)

Methods

filter()

def filter(**kwargs: Any) -> EvalResult:

Return a new EvalResult containing only the rows where EvalRow attribute values match the given keyword arguments. The summary is recomputed to include only the systems present in the filtered rows.

# Filter to one system
headroom_result = result.filter(system="headroom")

# Filter to one dataset
hotpot_result = result.filter(dataset="hotpotqa")

to_json()

def to_json() -> str:

Serialize the full result to a JSON string. Includes all rows, summary, timing, and config.

json_str = result.to_json()
with open("results.json", "w") as f:
    f.write(json_str)

to_dataframe()

def to_dataframe() -> pandas.DataFrame:

Convert all rows to a pandas DataFrame. Each row becomes one record. Score keys from EvalRow.scores and metadata keys from EvalRow.metadata are flattened into top-level columns.

to_dataframe() requires pandas to be installed. Install it with pip install pandas.

df = result.to_dataframe()
print(df.columns.tolist())
# ['system', 'example_id', 'dataset', 'input_tokens', 'output_tokens',
#  'latency', 'f1', 'exact_match', 'recall', 'contains', ...]

# Filter and aggregate with pandas
df[df["system"] == "kompact"]["f1"].mean()

EvalRow

EvalRow is a flat dataclass. One row is created per (system, example) pair by evaluate(), or per (system, example, query) triplet by evaluate_memory().

@dataclass
class EvalRow:
    system: str
    example_id: str | int
    scores: dict[str, float]
    input_tokens: int
    output_tokens: int
    metadata: dict[str, Any]
    latency: float
    dataset: str

Fields

system

str

required

The name of the system that produced this row. Matches the .name property of the system object.

example_id

str | int

required

The identifier of the example, taken from example["id"]. Falls back to the integer index within the dataset if the example has no "id" key.

scores

dict[str, float]

required

All scores produced by evaluators for this row. Keys are evaluator-specific score names, e.g. {"f1": 0.8, "exact_match": 0.0, "recall": 1.0, "contains": 1.0}.

input_tokens

int

required

Token count of the input example dict, measured before the system processes it.

output_tokens

int

required

Token count of the processed dict returned by the system, measured after .process() returns.

metadata

dict[str, Any]

Additional data captured during processing. When the system returns "api_usage" in its output dict, the following keys are populated automatically:

prompt_tokens — reported by the proxy API
completion_tokens — reported by the proxy API
total_tokens — reported by the proxy API

For memory evaluations, metadata also contains qa_type, ingest_latency, query_latency, conversation_id, and turn_count.

latency

float

Wall-clock time in seconds for the system’s .process() call, measured with time.monotonic(). Does not include evaluator scoring time.

dataset

str

Dataset tag for the example, taken from example["dataset"]. Empty string if the example has no "dataset" key.

Examples

Accessing the summary

from context_bench import OpenAIProxy, evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore, PassRate, CompressionRatio

result = evaluate(
    systems=[OpenAIProxy("http://localhost:7878", name="my-system")],
    dataset=your_dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1"), PassRate(score_field="f1"), CompressionRatio()],
)

print(result.summary["my-system"]["mean_score"])       # 0.734
print(result.summary["my-system"]["pass_rate"])        # 0.612
print(result.summary["my-system"]["compression_ratio"]) # 0.45

Iterating over rows

for row in result.rows:
    print(row.system, row.example_id, row.scores["f1"], row.latency)

Exporting to JSON

json_str = result.to_json()
with open("results.json", "w") as f:
    f.write(json_str)

Exporting to a DataFrame

df = result.to_dataframe()

# Per-system mean F1
df.groupby("system")["f1"].mean()

# Slowest examples
df.sort_values("latency", ascending=False).head(10)

Filtering to one system

headroom_result = result.filter(system="headroom")
print(len(headroom_result.rows))           # rows for headroom only
print(headroom_result.summary["headroom"]) # summary recomputed

Filtering to one dataset

hotpot_result = result.filter(dataset="hotpotqa")

Python API

Evaluators

Metrics

Datasets

EvalResult

EvalResult

Fields

Methods

filter()

to_json()

to_dataframe()

EvalRow

Fields

Examples

Accessing the summary

Iterating over rows

Exporting to JSON

Exporting to a DataFrame

Filtering to one system

Filtering to one dataset

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​EvalResult

​Fields

​Methods

​filter()

​to_json()

​to_dataframe()

​EvalRow

​Fields

​Examples

​Accessing the summary

​Iterating over rows

​Exporting to JSON

​Exporting to a DataFrame

​Filtering to one system

​Filtering to one dataset

Build docs developers (and LLMs) love

EvalResult

Fields

Methods

filter()

to_json()

to_dataframe()

EvalRow

Fields

Examples

Accessing the summary

Iterating over rows

Exporting to JSON

Exporting to a DataFrame

Filtering to one system

Filtering to one dataset