Quickstart

This guide walks you through running your first LLM evaluation with @deepagents/evals.

Basic Example

Here’s a complete evaluation that tests a simple question-answering task:

import { dataset, evaluate, exactMatch } from '@deepagents/evals';
import { consoleReporter } from '@deepagents/evals/reporters';
import { RunStore } from '@deepagents/evals/store';

// Define your task function
const task = async (item: { input: string; expected: string }) => {
  // Call your LLM here
  const response = await callMyLLM(item.input);
  return { output: response };
};

// Run the evaluation
const summary = await evaluate({
  name: 'my-first-eval',
  model: 'gpt-4o',
  dataset: dataset([
    { input: 'What is 2+2?', expected: '4' },
    { input: 'What is 3+3?', expected: '6' },
  ]),
  task,
  scorers: { exact: exactMatch },
  reporters: [consoleReporter()],
  store: new RunStore(),
});

console.log(summary);

Step-by-Step Breakdown

1. Import Dependencies

import { dataset, evaluate, exactMatch } from '@deepagents/evals';
import { consoleReporter } from '@deepagents/evals/reporters';
import { RunStore } from '@deepagents/evals/store';

2. Create a Dataset

The dataset is an array of test cases. Each case should have an input and an expected value:

const ds = dataset([
  { input: 'What is 2+2?', expected: '4' },
  { input: 'What is 3+3?', expected: '6' },
  { input: 'What is the capital of France?', expected: 'Paris' },
]);

You can also load from files:

const ds = dataset('./questions.json');

3. Define a Task Function

The task function calls your LLM and returns the output:

const task = async (item: { input: string; expected: string }) => {
  // Example using OpenAI
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: item.input }],
  });

  return {
    output: completion.choices[0]!.message.content!,
    usage: {
      inputTokens: completion.usage!.prompt_tokens,
      outputTokens: completion.usage!.completion_tokens,
    },
  };
};

The task function should return { output: string, usage?: { inputTokens: number, outputTokens: number } }.

4. Choose Scorers

Scorers evaluate the quality of the output. Use built-in scorers or create custom ones:

import { exactMatch, includes } from '@deepagents/evals/scorers';

const scorers = {
  exact: exactMatch,
  contains: includes,
};

See the Scorers guide for all available options.

5. Set Up Reporters

Reporters display evaluation results:

import { consoleReporter } from '@deepagents/evals/reporters';

const reporters = [
  consoleReporter({ verbosity: 'normal' }),
];

6. Create a Run Store

The run store persists results to SQLite:

import { RunStore } from '@deepagents/evals/store';

const store = new RunStore('.evals/store.db');

7. Run the Evaluation

Put it all together with evaluate():

const summary = await evaluate({
  name: 'my-first-eval',
  model: 'gpt-4o',
  dataset: ds,
  task,
  scorers,
  reporters,
  store,
});

console.log(summary);

Understanding the Output

The console reporter displays a progress bar and summary:

  ── my-first-eval ──────────────────────────────────────────
  Running 3 cases...

  Summary
  ────────────────────────────────────────────────────────────
  Eval:      my-first-eval
  Model:     gpt-4o
  Threshold: 0.5
  Cases:     3
  Pass/Fail: 2 / 1 (66.7%)
  Duration:  1.2s
  Tokens:    In: 45  Out: 12  Total: 57

  Scores
    exact:      0.667
  ────────────────────────────────────────────────────────────

The summary object contains detailed metrics:

interface RunSummary {
  totalCases: number;
  passCount: number;
  failCount: number;
  meanScores: Record<string, number>;
  totalLatencyMs: number;
  totalTokensIn: number;
  totalTokensOut: number;
}

Advanced Configuration

Concurrency

Control how many cases run in parallel:

await evaluate({
  // ...
  maxConcurrency: 5, // Run 5 cases at a time
});

Timeout

Set a per-case timeout:

await evaluate({
  // ...
  timeout: 30_000, // 30 seconds per case
});

Threshold

Set the minimum passing score:

await evaluate({
  // ...
  threshold: 0.8, // Require 80% score to pass
});

Trials

Run each case multiple times to reduce variance:

await evaluate({
  // ...
  trials: 3, // Run each case 3 times, average the scores
});

Builder Pattern

The evaluate() function returns an EvalBuilder that supports chaining:

// Run only failed cases from the previous run
await evaluate(options).failed();

// Run specific cases
await evaluate(options).cases('0-10,15,20-25');

// Run a random sample
await evaluate(options).sample(50);

// Throw if any cases fail
await evaluate(options).assert();

What’s Next?

Datasets

Learn about dataset loading and transforms

Scorers

Explore all built-in scorers

Comparison

Compare runs and detect regressions

API Reference

Full API documentation

Overview

Guides

API Reference

Quickstart

Quickstart

Basic Example

Step-by-Step Breakdown

1. Import Dependencies

2. Create a Dataset

3. Define a Task Function

4. Choose Scorers

5. Set Up Reporters

6. Create a Run Store

7. Run the Evaluation

Understanding the Output

Advanced Configuration

Concurrency

Timeout

Threshold

Trials

Builder Pattern

What’s Next?

Datasets

Scorers

Comparison

API Reference

Build docs developers (and LLMs) love

Overview

Guides

API Reference

​Quickstart

​Basic Example

​Step-by-Step Breakdown

​1. Import Dependencies

​2. Create a Dataset

​3. Define a Task Function

​4. Choose Scorers

​5. Set Up Reporters

​6. Create a Run Store

​7. Run the Evaluation

​Understanding the Output

​Advanced Configuration

​Concurrency

​Timeout

​Threshold

​Trials

​Builder Pattern

​What’s Next?

Datasets

Scorers

Comparison

API Reference

Build docs developers (and LLMs) love

Quickstart

Basic Example

Step-by-Step Breakdown

1. Import Dependencies

2. Create a Dataset

3. Define a Task Function

4. Choose Scorers

5. Set Up Reporters

6. Create a Run Store

7. Run the Evaluation

Understanding the Output

Advanced Configuration

Concurrency

Timeout

Threshold

Trials

Builder Pattern

What’s Next?