Skip to main content

evaluate()

The evaluate() function is the main entry point for running evaluations.

Import

import { evaluate } from '@deepagents/evals';

Signature

function evaluate<T>(
  options: EvaluateOptions<T>
): EvalBuilder<RunSummary>;

function evaluate<T, V extends { name: string }>(
  options: EvaluateEachOptions<T, V>
): EvalBuilder<RunSummary[]>;

Options

EvaluateOptions<T>

Evaluate a single model:
interface EvaluateOptions<T> {
  name: string;
  model: string;
  dataset: AsyncIterable<T>;
  task: TaskFn<T>;
  scorers: Record<string, Scorer>;
  reporters: Reporter[];
  store: RunStore;
  suiteId?: string;
  maxConcurrency?: number;
  timeout?: number;
  trials?: number;
  threshold?: number;
}

name

Human-readable name for the evaluation run.
name: 'my-eval'

model

Model identifier (passed to reporters and stored in the database).
model: 'gpt-4o'

dataset

Dataset of input/expected pairs. See Datasets.
import { dataset } from '@deepagents/evals/dataset';

dataset: dataset([
  { input: 'What is 2+2?', expected: '4' },
])

task

Function that calls your model and returns the output.
task: async (item) => {
  const response = await callMyLLM(item.input);
  return {
    output: response,
    usage: { inputTokens: 10, outputTokens: 5 },
  };
}
Type:
type TaskFn<T> = (input: T) => Promise<TaskResult>;

interface TaskResult {
  output: string;
  usage?: { inputTokens: number; outputTokens: number };
}

scorers

Named scoring functions. See Scorers.
import { exactMatch, includes } from '@deepagents/evals/scorers';

scorers: {
  exact: exactMatch,
  contains: includes,
}

reporters

Reporters that receive lifecycle events and produce output.
import { consoleReporter } from '@deepagents/evals/reporters';

reporters: [consoleReporter({ verbosity: 'normal' })]

store

Persistent store for run history.
import { RunStore } from '@deepagents/evals/store';

store: new RunStore('.evals/store.db')

suiteId (optional)

Associate this run with an existing suite ID.
const suite = store.createSuite('text2sql-accuracy');
suiteId: suite.id
If omitted, a new suite is created with the name.

maxConcurrency (optional)

Maximum number of cases to run concurrently.
maxConcurrency: 10  // Default: 10

timeout (optional)

Per-case timeout in milliseconds.
timeout: 30_000  // Default: 30000 (30 seconds)

trials (optional)

Number of times to run each case and average the scores.
trials: 3  // Run each case 3 times

threshold (optional)

Minimum average score (0–1) required for a case to pass.
threshold: 0.5  // Default: 0.5

EvaluateEachOptions<T, V>

Evaluate multiple model variants:
interface EvaluateEachOptions<T, V extends { name: string }> {
  name: string;
  models: V[];
  dataset: AsyncIterable<T>;
  task: (input: T, variant: V) => Promise<TaskResult>;
  scorers: Record<string, Scorer>;
  reporters: Reporter[];
  store: RunStore;
  maxConcurrency?: number;
  timeout?: number;
  trials?: number;
  threshold?: number;
}

models

Array of model variants. Each variant must have a name property:
models: [
  { name: 'gpt-4o', temperature: 0.7 },
  { name: 'gpt-4o-mini', temperature: 0.7 },
]

task

Task function receives both the input and the current variant:
task: async (input, variant) => {
  const response = await callMyLLM(input.input, variant);
  return { output: response };
}

Return Value

The evaluate() function returns an EvalBuilder that implements PromiseLike, so you can await it directly:
const summary = await evaluate(options);
Or use the builder methods:

failed()

Run only cases that failed in the previous run:
await evaluate(options).failed();

cases(spec)

Run specific cases by index:
await evaluate(options).cases('0-10,15,20-25');
Supported formats:
  • 0-10 — Range from 0 to 10 (inclusive)
  • 5 — Single index
  • 0-10,15,20-25 — Multiple ranges and indexes

sample(n)

Run a random sample of n cases:
await evaluate(options).sample(50);

assert()

Throw EvalAssertionError if any cases fail:
try {
  await evaluate(options).assert();
} catch (err) {
  if (err instanceof EvalAssertionError) {
    console.error('Eval failed:', err.summary);
  }
}

Example: Single Model

import { evaluate, dataset, exactMatch } from '@deepagents/evals';
import { consoleReporter } from '@deepagents/evals/reporters';
import { RunStore } from '@deepagents/evals/store';

const summary = await evaluate({
  name: 'my-eval',
  model: 'gpt-4o',
  dataset: dataset([
    { input: 'What is 2+2?', expected: '4' },
  ]),
  task: async (item) => {
    const response = await callMyLLM(item.input);
    return { output: response };
  },
  scorers: { exact: exactMatch },
  reporters: [consoleReporter()],
  store: new RunStore(),
});

console.log(summary);

Example: Multiple Models

import { evaluate, dataset, exactMatch } from '@deepagents/evals';
import { consoleReporter } from '@deepagents/evals/reporters';
import { RunStore } from '@deepagents/evals/store';

const summaries = await evaluate({
  name: 'model-comparison',
  models: [
    { name: 'gpt-4o' },
    { name: 'gpt-4o-mini' },
  ],
  dataset: dataset([
    { input: 'What is 2+2?', expected: '4' },
  ]),
  task: async (item, variant) => {
    const response = await callMyLLM(item.input, variant.name);
    return { output: response };
  },
  scorers: { exact: exactMatch },
  reporters: [consoleReporter()],
  store: new RunStore(),
});

for (const summary of summaries) {
  console.log(summary);
}

Example: Builder Pattern

// Run only failed cases
await evaluate(options).failed();

// Run specific cases
await evaluate(options).cases('0-10,15');

// Run random sample
await evaluate(options).sample(50);

// Assert no failures
await evaluate(options).assert();

// Chain methods
await evaluate(options).cases('0-10').assert();

Types

RunSummary

interface RunSummary {
  totalCases: number;
  passCount: number;
  failCount: number;
  meanScores: Record<string, number>;
  totalLatencyMs: number;
  totalTokensIn: number;
  totalTokensOut: number;
}

EvalAssertionError

class EvalAssertionError extends Error {
  summary: RunSummary | RunSummary[];
}

Next Steps

Datasets

Learn about dataset loading

Scorers

Explore scorer functions

Engine API

Lower-level engine API

Build docs developers (and LLMs) love