EvalResult is a dataclass returned by evaluate() and evaluate_memory(). It contains every scored row, aggregated summary statistics, timing data, and the run configuration.
EvalResult
Fields
All scored rows from the evaluation run. Each row corresponds to one (system, example) pair. For memory evaluations, each row corresponds to one (system, example, query) triplet.
Per-system metric summaries. The outer key is the system name; the inner dict maps metric names to their computed values.If no
metrics were passed to evaluate(), this dict contains empty dicts per system.Wall-clock time in seconds spent processing each system. Key is the system name.
Snapshot of the run configuration for reproducibility:
"systems"— list of system names"evaluators"— list of evaluator names"metrics"— list of metric names"num_examples"— number of examples evaluated"max_workers"— thread count used (1 for sequential)
Methods
filter()
EvalResult containing only the rows where EvalRow attribute values match the given keyword arguments. The summary is recomputed to include only the systems present in the filtered rows.
to_json()
to_dataframe()
DataFrame. Each row becomes one record. Score keys from EvalRow.scores and metadata keys from EvalRow.metadata are flattened into top-level columns.
EvalRow
EvalRow is a flat dataclass. One row is created per (system, example) pair by evaluate(), or per (system, example, query) triplet by evaluate_memory().
Fields
The name of the system that produced this row. Matches the
.name property of the system object.The identifier of the example, taken from
example["id"]. Falls back to the integer index within the dataset if the example has no "id" key.All scores produced by evaluators for this row. Keys are evaluator-specific score names, e.g.
{"f1": 0.8, "exact_match": 0.0, "recall": 1.0, "contains": 1.0}.Token count of the input example dict, measured before the system processes it.
Token count of the processed dict returned by the system, measured after
.process() returns.Additional data captured during processing. When the system returns
"api_usage" in its output dict, the following keys are populated automatically:prompt_tokens— reported by the proxy APIcompletion_tokens— reported by the proxy APItotal_tokens— reported by the proxy API
metadata also contains qa_type, ingest_latency, query_latency, conversation_id, and turn_count.Wall-clock time in seconds for the system’s
.process() call, measured with time.monotonic(). Does not include evaluator scoring time.Dataset tag for the example, taken from
example["dataset"]. Empty string if the example has no "dataset" key.