TokenEfficiencyMetric

TokenEfficiencyMetric computes accuracy per token as a primary optimization target for the autoresearch loop. It produces a single number to maximize when trading off quality versus cost.

from context_bench.metrics import TokenEfficiencyMetric

Constructor parameters

score_field

string

default:"f1"

Name of the score key to use as quality. Must match a key emitted by an evaluator.

Formula

token_efficiency = mean(scores[score_field]) * (100 / mean(input_tokens))^0.1

The score is F1-dominant: it softly penalizes token bloat without allowing the metric to be gamed by driving tokens to near-zero. A raw ratio (mean_score / (mean_input_tokens / 1000)) is also returned as token_efficiency_raw.

Return value

token_efficiency

float

Adjusted score per token (F1-dominant). The primary optimization target — higher is better. Maximize this when tuning a retrieval pipeline.

token_efficiency_raw

float

Raw mean_score / (mean_input_tokens / 1000). Kept for analysis; use token_efficiency for optimization.

mean_score

float

Mean value of score_field across all rows.

mean_input_tokens

float

Mean input token count per row.

mean_ingest_latency

float

Mean ingest latency in seconds (from row.metadata["ingest_latency"]). Relevant for memory system evaluations.

mean_query_latency

float

Mean query latency in seconds (from row.metadata["query_latency"]). Relevant for memory system evaluations.

Usage

from context_bench import evaluate_memory
from context_bench.evaluators import AnswerQuality, MemoryJudge
from context_bench.metrics import TokenEfficiencyMetric

result = evaluate_memory(
    systems=[my_memory_system],
    dataset=locomo_data,
    evaluators=[AnswerQuality(), MemoryJudge(base_url="http://localhost:8080")],
    metrics=[TokenEfficiencyMetric(score_field="f1")],
)

summary = result.summary["my-memory-system"]
print(summary["token_efficiency"])     # e.g. 0.043  (higher = better)
print(summary["mean_score"])           # e.g. 0.61
print(summary["mean_input_tokens"])    # e.g. 1420.0

When it is enabled

TokenEfficiencyMetric is automatically used by the context-bench memory CLI subcommand. It is also used as the optimization target in the autoresearch loop.

Maximize token_efficiency when running the autoresearch loop to find pipelines that achieve high quality without spending excessive tokens on context retrieval.

Python API

Evaluators

Metrics

Datasets

Constructor parameters

Formula

Return value

Usage

When it is enabled

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​Constructor parameters

​Formula

​Return value

​Usage

​When it is enabled

Build docs developers (and LLMs) love

Constructor parameters

Formula

Return value

Usage

When it is enabled