Skip to main content

Overview

The @deepagents/evals package provides a comprehensive evaluation framework for LLM systems. It includes dataset loading, deterministic and LLM-based scorers, run persistence, model comparison, and multiple output formats.

Installation

npm install @deepagents/evals

Quick Example

import { dataset, evaluate, exactMatch } from '@deepagents/evals';
import { agent, generate } from '@deepagents/agent';
import { openai } from '@ai-sdk/openai';

const mathAgent = agent({
  name: 'math',
  model: openai('gpt-4o'),
  prompt: 'Solve math problems. Return only the numeric answer.',
});

const summary = await evaluate({
  name: 'math-accuracy',
  model: 'gpt-4o',
  dataset: dataset([
    { input: 'What is 2+2?', expected: '4' },
    { input: 'What is 10*5?', expected: '50' },
    { input: 'What is 100/4?', expected: '25' },
  ]),
  task: async (item) => {
    const result = await generate(mathAgent, item.input, {});
    return { output: (await result.text).trim() };
  },
  scorers: { exact: exactMatch },
});

console.log('Accuracy:', summary.scorerMeans.exact); // 1.0
console.log('Pass rate:', summary.passRate);         // 1.0

Core API

evaluate(config)

Run evaluation with dataset and scorers:
interface EvaluateOptions {
  name: string;                    // Eval name
  model: string;                   // Model identifier
  dataset: Dataset;                // Test cases
  task: TaskFn;                    // Function to test
  scorers: Record<string, Scorer>; // Scoring functions
  maxConcurrency?: number;         // Parallel executions
  timeout?: number;                // Per-case timeout (ms)
  threshold?: number;              // Pass threshold (0-1)
  store?: RunStore;                // Persistence
}

const summary = await evaluate({
  name: 'my-eval',
  model: 'gpt-4o',
  dataset: myDataset,
  task: myTask,
  scorers: { accuracy: exactMatch },
  maxConcurrency: 10,
  threshold: 0.8,
});
Returns:
interface RunSummary {
  runId: string;
  totalCases: number;
  scorerMeans: Record<string, number>;
  passRate: number;
  avgLatencyMs: number;
  errors: number;
}

Datasets

dataset(source)

Load data from various sources:
import { dataset } from '@deepagents/evals';

const ds = dataset([
  { input: 'hello', expected: 'world' },
  { input: 'foo', expected: 'bar' },
]);

Dataset Transforms

Chainable, lazy transformations:
const ds = dataset('./large-dataset.jsonl')
  .filter((row) => row.difficulty === 'hard')
  .map((row) => ({
    input: row.question,
    expected: row.answer,
  }))
  .shuffle()
  .limit(100);
Available Transforms:
TransformTypeDescription
map(fn)LazyTransform each element
filter(fn)LazyExclude non-matching
limit(n)LazyCap at n elements
shuffle()EagerRandomize order
sample(n)EagerPick n random
toArray()-Consume to array

Scorers

Deterministic Scorers

import { exactMatch } from '@deepagents/evals/scorers';

const scorer = exactMatch;
// Strict string equality

LLM-Based Scorers

import { factuality } from '@deepagents/evals/scorers';

const scorer = factuality({
  model: 'gpt-4o-mini', // OpenAI-compatible model
});

const summary = await evaluate({
  // ...
  scorers: {
    exact: exactMatch,
    factual: scorer,
  },
});
factuality uses the autoevals library and requires an OpenAI-compatible model.

Scorer Combinators

all() - Minimum Score

import { all, exactMatch, includes } from '@deepagents/evals/scorers';

const strict = all(exactMatch, includes);
// Returns lowest score from all scorers

any() - Maximum Score

import { any, exactMatch, includes } from '@deepagents/evals/scorers';

const lenient = any(exactMatch, includes);
// Returns highest score from all scorers

weighted() - Weighted Average

import { weighted, exactMatch, factuality } from '@deepagents/evals/scorers';

const balanced = weighted({
  accuracy: { scorer: exactMatch, weight: 2 },
  grounding: { scorer: factuality({ model: 'gpt-4o-mini' }), weight: 1 },
});

Custom Scorers

Create your own:
import type { Scorer } from '@deepagents/evals';

const customScorer: Scorer = async ({ output, expected }) => {
  const score = computeSimilarity(output, expected);
  return {
    score: score, // 0-1
    reason: `Similarity: ${score.toFixed(2)}`,
  };
};

const summary = await evaluate({
  // ...
  scorers: { custom: customScorer },
});

Run Store

Persist evaluation runs to SQLite:
import { RunStore } from '@deepagents/evals/store';

const store = new RunStore('.evals/store.db');

// Create suite for grouping
const suite = store.createSuite('text2sql-accuracy');

// Run evaluation
const summary = await evaluate({
  // ...
  store,
});

// Query results
const runs = store.listRuns(suite.id);
const failing = store.getFailingCases(summary.runId, 0.5);
const runSummary = store.getRunSummary(summary.runId);

// Close
store.close();
Storage Schema:
  • suites - Evaluation suites
  • runs - Individual runs
  • cases - Test cases
  • scores - Scoring results

Engine

Low-level evaluation engine:
import { EvalEmitter, runEval } from '@deepagents/evals/engine';

const emitter = new EvalEmitter();

emitter.on('run:start', (data) => console.log('Started:', data.name));
emitter.on('case:scored', (data) => console.log('Case', data.index, ':', data.scores));
emitter.on('run:end', (data) => console.log('Completed:', data.summary));

const summary = await runEval({
  name: 'my-eval',
  model: 'gpt-4o',
  dataset: myDataset,
  task: myTask,
  scorers: { exact: exactMatch },
  emitter,
  store,
  maxConcurrency: 10,
  timeout: 30_000,
});
Events:
EventPayloadWhen
run:start{ runId, totalCases, name, model }Run begins
case:start{ runId, index, input }Case starts
case:scored{ runId, index, scores, latencyMs }Case scored
case:error{ runId, index, error }Task error
run:end{ runId, summary }Run completes

Comparison

Compare two evaluation runs:
import { compareRuns } from '@deepagents/evals/comparison';

const result = compareRuns(
  store,
  baselineRunId,
  candidateRunId,
  {
    tolerance: 0.01,         // Ignore changes < 1%
    regressionThreshold: 0.05, // Flag drops > 5%
  }
);

console.log('Regressed:', result.regression.regressed);
console.log('Scorer deltas:', result.scorerSummaries);
console.log('Latency delta:', result.costDelta.avgLatencyMs);
Returns:
interface ComparisonResult {
  regression: {
    regressed: boolean;
    regressedScorers: string[];
  };
  scorerSummaries: Record<string, ScorerSummary>;
  costDelta: {
    avgLatencyMs: number;
  };
  caseDiffs: CaseDiff[];
}

Reporters

Output results in various formats.

Console Reporter

import { consoleReporter } from '@deepagents/evals/reporters';

const emitter = new EvalEmitter();
consoleReporter(emitter, {
  verbosity: 'normal', // 'quiet' | 'normal' | 'verbose'
  threshold: 0.5,
});

const summary = await runEval({ /* ... */, emitter });

File Reporters

import { jsonReporter } from '@deepagents/evals/reporters';

jsonReporter(emitter, './results.json');

Complete Example

import { dataset, evaluate, exactMatch, levenshtein, weighted } from '@deepagents/evals';
import { RunStore } from '@deepagents/evals/store';
import { agent, generate } from '@deepagents/agent';
import { openai } from '@ai-sdk/openai';

// Setup
const store = new RunStore('.evals/runs.db');
const suite = store.createSuite('summarization');

const summarizer = agent({
  name: 'summarizer',
  model: openai('gpt-4o'),
  prompt: 'Summarize text concisely.',
});

// Run evaluation
const summary = await evaluate({
  name: 'summarization-quality',
  model: 'gpt-4o',
  dataset: dataset('./data/summaries.jsonl')
    .filter(r => r.length < 1000)
    .limit(50),
  task: async (item) => {
    const result = await generate(summarizer, item.input, {});
    return { output: await result.text };
  },
  scorers: {
    similarity: weighted({
      exact: { scorer: exactMatch, weight: 1 },
      fuzzy: { scorer: levenshtein, weight: 2 },
    }),
  },
  maxConcurrency: 5,
  timeout: 10_000,
  threshold: 0.7,
  store,
});

console.log('Pass rate:', summary.passRate);
console.log('Avg latency:', summary.avgLatencyMs, 'ms');

store.close();

Package Info

@deepagents/agent

Evaluate agent performance

@deepagents/text2sql

Evaluate SQL accuracy

Build docs developers (and LLMs) love