evaluate()

The evaluate() function is the main entry point for running evaluations.

Import

import { evaluate } from '@deepagents/evals';

Signature

function evaluate<T>(
  options: EvaluateOptions<T>
): EvalBuilder<RunSummary>;

function evaluate<T, V extends { name: string }>(
  options: EvaluateEachOptions<T, V>
): EvalBuilder<RunSummary[]>;

Options

`EvaluateOptions<T>`

Evaluate a single model:

interface EvaluateOptions<T> {
  name: string;
  model: string;
  dataset: AsyncIterable<T>;
  task: TaskFn<T>;
  scorers: Record<string, Scorer>;
  reporters: Reporter[];
  store: RunStore;
  suiteId?: string;
  maxConcurrency?: number;
  timeout?: number;
  trials?: number;
  threshold?: number;
}

`name`

Human-readable name for the evaluation run.

name: 'my-eval'

`model`

Model identifier (passed to reporters and stored in the database).

model: 'gpt-4o'

`dataset`

Dataset of input/expected pairs. See Datasets.

import { dataset } from '@deepagents/evals/dataset';

dataset: dataset([
  { input: 'What is 2+2?', expected: '4' },
])

`task`

Function that calls your model and returns the output.

task: async (item) => {
  const response = await callMyLLM(item.input);
  return {
    output: response,
    usage: { inputTokens: 10, outputTokens: 5 },
  };
}

Type:

type TaskFn<T> = (input: T) => Promise<TaskResult>;

interface TaskResult {
  output: string;
  usage?: { inputTokens: number; outputTokens: number };
}

`scorers`

Named scoring functions. See Scorers.

import { exactMatch, includes } from '@deepagents/evals/scorers';

scorers: {
  exact: exactMatch,
  contains: includes,
}

`reporters`

Reporters that receive lifecycle events and produce output.

import { consoleReporter } from '@deepagents/evals/reporters';

reporters: [consoleReporter({ verbosity: 'normal' })]

`store`

Persistent store for run history.

import { RunStore } from '@deepagents/evals/store';

store: new RunStore('.evals/store.db')

`suiteId` (optional)

Associate this run with an existing suite ID.

const suite = store.createSuite('text2sql-accuracy');
suiteId: suite.id

If omitted, a new suite is created with the name.

`maxConcurrency` (optional)

Maximum number of cases to run concurrently.

maxConcurrency: 10  // Default: 10

`timeout` (optional)

Per-case timeout in milliseconds.

timeout: 30_000  // Default: 30000 (30 seconds)

`trials` (optional)

Number of times to run each case and average the scores.

trials: 3  // Run each case 3 times

`threshold` (optional)

Minimum average score (0–1) required for a case to pass.

threshold: 0.5  // Default: 0.5

`EvaluateEachOptions<T, V>`

Evaluate multiple model variants:

interface EvaluateEachOptions<T, V extends { name: string }> {
  name: string;
  models: V[];
  dataset: AsyncIterable<T>;
  task: (input: T, variant: V) => Promise<TaskResult>;
  scorers: Record<string, Scorer>;
  reporters: Reporter[];
  store: RunStore;
  maxConcurrency?: number;
  timeout?: number;
  trials?: number;
  threshold?: number;
}

`models`

Array of model variants. Each variant must have a name property:

models: [
  { name: 'gpt-4o', temperature: 0.7 },
  { name: 'gpt-4o-mini', temperature: 0.7 },
]

`task`

Task function receives both the input and the current variant:

task: async (input, variant) => {
  const response = await callMyLLM(input.input, variant);
  return { output: response };
}

Return Value

The evaluate() function returns an EvalBuilder that implements PromiseLike, so you can await it directly:

const summary = await evaluate(options);

Or use the builder methods:

`failed()`

Run only cases that failed in the previous run:

await evaluate(options).failed();

`cases(spec)`

Run specific cases by index:

await evaluate(options).cases('0-10,15,20-25');

Supported formats:

0-10 — Range from 0 to 10 (inclusive)
5 — Single index
0-10,15,20-25 — Multiple ranges and indexes

`sample(n)`

Run a random sample of n cases:

await evaluate(options).sample(50);

`assert()`

Throw EvalAssertionError if any cases fail:

try {
  await evaluate(options).assert();
} catch (err) {
  if (err instanceof EvalAssertionError) {
    console.error('Eval failed:', err.summary);
  }
}

Example: Single Model

import { evaluate, dataset, exactMatch } from '@deepagents/evals';
import { consoleReporter } from '@deepagents/evals/reporters';
import { RunStore } from '@deepagents/evals/store';

const summary = await evaluate({
  name: 'my-eval',
  model: 'gpt-4o',
  dataset: dataset([
    { input: 'What is 2+2?', expected: '4' },
  ]),
  task: async (item) => {
    const response = await callMyLLM(item.input);
    return { output: response };
  },
  scorers: { exact: exactMatch },
  reporters: [consoleReporter()],
  store: new RunStore(),
});

console.log(summary);

Example: Multiple Models

import { evaluate, dataset, exactMatch } from '@deepagents/evals';
import { consoleReporter } from '@deepagents/evals/reporters';
import { RunStore } from '@deepagents/evals/store';

const summaries = await evaluate({
  name: 'model-comparison',
  models: [
    { name: 'gpt-4o' },
    { name: 'gpt-4o-mini' },
  ],
  dataset: dataset([
    { input: 'What is 2+2?', expected: '4' },
  ]),
  task: async (item, variant) => {
    const response = await callMyLLM(item.input, variant.name);
    return { output: response };
  },
  scorers: { exact: exactMatch },
  reporters: [consoleReporter()],
  store: new RunStore(),
});

for (const summary of summaries) {
  console.log(summary);
}

Example: Builder Pattern

// Run only failed cases
await evaluate(options).failed();

// Run specific cases
await evaluate(options).cases('0-10,15');

// Run random sample
await evaluate(options).sample(50);

// Assert no failures
await evaluate(options).assert();

// Chain methods
await evaluate(options).cases('0-10').assert();

Types

`RunSummary`

interface RunSummary {
  totalCases: number;
  passCount: number;
  failCount: number;
  meanScores: Record<string, number>;
  totalLatencyMs: number;
  totalTokensIn: number;
  totalTokensOut: number;
}

`EvalAssertionError`

class EvalAssertionError extends Error {
  summary: RunSummary | RunSummary[];
}

Next Steps

Datasets

Learn about dataset loading

Scorers

Explore scorer functions

Engine API

Lower-level engine API

Overview

Guides

API Reference

evaluate()

evaluate()

Import

Signature

Options

`EvaluateOptions<T>`

`name`

`model`

`dataset`

`task`

`scorers`

`reporters`

`store`

`suiteId` (optional)

`maxConcurrency` (optional)

`timeout` (optional)

`trials` (optional)

`threshold` (optional)

`EvaluateEachOptions<T, V>`

`models`

`task`

Return Value

`failed()`

`cases(spec)`

`sample(n)`

`assert()`

Example: Single Model

Example: Multiple Models

Example: Builder Pattern

Types

`RunSummary`

`EvalAssertionError`

Next Steps

Datasets

Scorers

Engine API

Build docs developers (and LLMs) love

Overview

Guides

API Reference

​evaluate()

​Import

​Signature

​Options

​EvaluateOptions<T>

​name

​model

​dataset

​task

​scorers

​reporters

​store

​suiteId (optional)

​maxConcurrency (optional)

​timeout (optional)

​trials (optional)

​threshold (optional)

​EvaluateEachOptions<T, V>

​models

​task

​Return Value

​failed()

​cases(spec)

​sample(n)

​assert()

​Example: Single Model

​Example: Multiple Models

​Example: Builder Pattern

​Types

​RunSummary

​EvalAssertionError

​Next Steps

Datasets

Scorers

Engine API

Build docs developers (and LLMs) love

evaluate()

Import

Signature

Options

`EvaluateOptions<T>`

`name`

`model`

`dataset`

`task`

`scorers`

`reporters`

`store`

`suiteId` (optional)

`maxConcurrency` (optional)

`timeout` (optional)

`trials` (optional)

`threshold` (optional)

`EvaluateEachOptions<T, V>`

`models`

`task`

Return Value

`failed()`

`cases(spec)`

`sample(n)`

`assert()`

Example: Single Model

Example: Multiple Models

Example: Builder Pattern

Types

`RunSummary`

`EvalAssertionError`

Next Steps