Skip to main content

Engine API

The engine module orchestrates dataset iteration, task execution, scoring, and persistence.

Import

import { EvalEmitter, runEval } from '@deepagents/evals/engine';

runEval(config)

Lower-level function to run an evaluation with full control.

Signature

function runEval<T>(config: EvalConfig<T>): Promise<RunSummary>;

Parameters

interface EvalConfig<T> {
  name: string;
  model: string;
  dataset: AsyncIterable<T>;
  task: TaskFn<T>;
  scorers: Record<string, Scorer>;
  store: RunStore;
  emitter?: EvalEmitter;
  suiteId?: string;
  config?: Record<string, unknown>;
  maxConcurrency?: number;
  batchSize?: number;
  timeout?: number;
  trials?: number;
  threshold?: number;
}

name

Human-readable name for the evaluation run.
name: 'my-eval'

model

Model identifier.
model: 'gpt-4o'

dataset

Dataset to evaluate.
import { dataset } from '@deepagents/evals/dataset';

dataset: dataset([{ input: 'What is 2+2?', expected: '4' }])

task

Task function that calls your model.
task: async (item) => {
  const response = await callMyLLM(item.input);
  return { output: response };
}
Type:
type TaskFn<T> = (input: T) => Promise<TaskResult>;

interface TaskResult {
  output: string;
  usage?: { inputTokens: number; outputTokens: number };
}

scorers

Named scoring functions.
import { exactMatch } from '@deepagents/evals/scorers';

scorers: { exact: exactMatch }

store

Run store for persistence.
import { RunStore } from '@deepagents/evals/store';

store: new RunStore('.evals/store.db')

emitter (optional)

Event emitter for lifecycle events.
const emitter = new EvalEmitter();
emitter.on('case:scored', (data) => console.log(data));

emitter: emitter
If omitted, a default emitter is created (but events won’t be observed).

suiteId (optional)

Associate with an existing suite.
suiteId: suite.id

config (optional)

Arbitrary configuration metadata to store with the run.
config: { temperature: 0.7, top_p: 1.0 }

maxConcurrency (optional)

Maximum concurrent cases.
maxConcurrency: 10  // Default: 10

batchSize (optional)

Process cases in batches. Waits for each batch to complete before starting the next.
batchSize: 50  // Process 50 cases at a time
If omitted, all cases are processed in one batch (subject to maxConcurrency).

timeout (optional)

Per-case timeout in milliseconds.
timeout: 30_000  // Default: 30000

trials (optional)

Run each case multiple times and average scores.
trials: 3

threshold (optional)

Minimum score to count as pass.
threshold: 0.5  // Default: 0.5

Return Value

Returns a RunSummary:
interface RunSummary {
  totalCases: number;
  passCount: number;
  failCount: number;
  meanScores: Record<string, number>;
  totalLatencyMs: number;
  totalTokensIn: number;
  totalTokensOut: number;
}

EvalEmitter

Event emitter for evaluation lifecycle events.

Constructor

const emitter = new EvalEmitter();

Events

run:start

Emitted when the run begins.
emitter.on('run:start', (data) => {
  console.log(`Starting run ${data.runId} with ${data.totalCases} cases`);
});
Payload:
{
  runId: string;
  totalCases: number;
  name: string;
  model: string;
}

case:start

Emitted when a case starts executing.
emitter.on('case:start', (data) => {
  console.log(`Case #${data.index} started`);
});
Payload:
{
  runId: string;
  index: number;
  input: unknown;
}

case:scored

Emitted when a case is scored (always fires, even on error).
emitter.on('case:scored', (data) => {
  console.log(`Case #${data.index} scored:`, data.scores);
});
Payload:
{
  runId: string;
  index: number;
  input: unknown;
  output: string;
  expected: unknown;
  scores: Record<string, ScorerResult>;
  error?: unknown;
  latencyMs: number;
  tokensIn: number;
  tokensOut: number;
}

case:error

Emitted when a case throws an error.
emitter.on('case:error', (data) => {
  console.error(`Case #${data.index} failed:`, data.error);
});
Payload:
{
  runId: string;
  index: number;
  error: string;
}

run:end

Emitted when the run completes.
emitter.on('run:end', (data) => {
  console.log('Run completed:', data.summary);
});
Payload:
{
  runId: string;
  summary: RunSummary;
}

Event Type

All events are typed:
interface EngineEvents {
  'run:start': { runId: string; totalCases: number; name: string; model: string };
  'case:start': { runId: string; index: number; input: unknown };
  'case:scored': {
    runId: string;
    index: number;
    input: unknown;
    output: string;
    expected: unknown;
    scores: Record<string, ScorerResult>;
    error?: unknown;
    latencyMs: number;
    tokensIn: number;
    tokensOut: number;
  };
  'case:error': { runId: string; index: number; error: string };
  'run:end': { runId: string; summary: RunSummary };
}

Examples

Basic Usage

import { runEval, EvalEmitter } from '@deepagents/evals/engine';
import { dataset } from '@deepagents/evals/dataset';
import { exactMatch } from '@deepagents/evals/scorers';
import { RunStore } from '@deepagents/evals/store';

const emitter = new EvalEmitter();
emitter.on('case:scored', (data) => {
  console.log(`Case #${data.index}: ${JSON.stringify(data.scores)}`);
});

const summary = await runEval({
  name: 'my-eval',
  model: 'gpt-4o',
  dataset: dataset([{ input: 'What is 2+2?', expected: '4' }]),
  task: async (item) => {
    const response = await callMyLLM(item.input);
    return { output: response };
  },
  scorers: { exact: exactMatch },
  store: new RunStore(),
  emitter,
  maxConcurrency: 5,
  timeout: 30_000,
});

console.log(summary);

Event Listeners

const emitter = new EvalEmitter();

emitter.on('run:start', (data) => {
  console.log(`Starting ${data.name} with ${data.totalCases} cases`);
});

emitter.on('case:scored', (data) => {
  const passed = Object.values(data.scores).every((s) => s.score >= 0.5);
  console.log(`Case #${data.index}: ${passed ? 'PASS' : 'FAIL'}`);
});

emitter.on('case:error', (data) => {
  console.error(`Case #${data.index} error: ${data.error}`);
});

emitter.on('run:end', (data) => {
  console.log(`Completed with ${data.summary.passCount}/${data.summary.totalCases} passed`);
});

await runEval({ emitter, /* ... */ });

Concurrency Control

await runEval({
  // ...
  maxConcurrency: 5,  // Run 5 cases in parallel
});

Batch Processing

await runEval({
  // ...
  batchSize: 50,      // Process 50 cases, then wait before starting next 50
  maxConcurrency: 10, // Within each batch, run 10 in parallel
});

Trials

await runEval({
  // ...
  trials: 3,  // Run each case 3 times, average the scores
});

Next Steps

Evaluate API

Higher-level evaluate() function

Reporters

Use reporters instead of raw events

Build docs developers (and LLMs) love