Scorers API

Scorers evaluate the quality of LLM outputs. All scorers implement the Scorer type.

Import

import {
  exactMatch,
  includes,
  regex,
  levenshtein,
  jsonMatch,
  factuality,
  all,
  any,
  weighted,
} from '@deepagents/evals/scorers';

Types

`Scorer`

type Scorer = (args: ScorerArgs) => Promise<ScorerResult>;

`ScorerArgs`

interface ScorerArgs {
  input: unknown;     // Original input from dataset
  output: string;     // Model output to score
  expected?: unknown; // Expected value from dataset
}

`ScorerResult`

interface ScorerResult {
  score: number;                      // 0..1 (0 = worst, 1 = best)
  reason?: string;                    // Human-readable explanation
  metadata?: Record<string, unknown>; // Additional scoring metadata
}

Deterministic Scorers

`exactMatch`

Strict string equality:

const result = await exactMatch({
  input: 'What is 2+2?',
  output: '4',
  expected: '4',
});
// { score: 1.0 }

Returns:

1.0 if output === String(expected)
0.0 otherwise, with a reason explaining the mismatch

`includes`

Substring check:

const result = await includes({
  input: 'What is the capital of France?',
  output: 'The capital of France is Paris.',
  expected: 'Paris',
});
// { score: 1.0 }

Returns:

1.0 if output.includes(String(expected))
0.0 otherwise

`regex(pattern)`

Regular expression test:

const emailScorer = regex(/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$/i);

const result = await emailScorer({
  input: 'Extract the email',
  output: '[email protected]',
});
// { score: 1.0 }

Signature:

function regex(pattern: RegExp): Scorer;

Returns:

1.0 if pattern.test(output)
0.0 otherwise

`levenshtein`

Normalized edit distance similarity:

const result = await levenshtein({
  input: 'Spell "hello"',
  output: 'helo',
  expected: 'hello',
});
// { score: 0.8, reason: '...', metadata: { ... } }

Returns:

1.0 for exact match
0.0 for completely different strings
Decimal between 0 and 1 for partial similarity
Includes reason and metadata from autoevals

`jsonMatch`

Deep structural equality for JSON:

const result = await jsonMatch({
  input: 'Generate JSON',
  output: '{"name":"Alice","age":30}',
  expected: { name: 'Alice', age: 30 },
});
// { score: 1.0 }

Returns:

1.0 if JSON structures are deeply equal
0.0 if structures differ or JSON is invalid

Notes:

Object key order doesn’t matter
Array order matters
expected can be a string or an object

LLM-Based Scorers

`factuality(config)`

Checks if output is factually correct:

const factScorer = factuality({ model: 'gpt-4o-mini' });

const result = await factScorer({
  input: 'What is the capital of France?',
  output: 'Paris is the capital and largest city of France.',
  expected: 'Paris',
});
// { score: 1.0, reason: 'Output is factually correct', metadata: { ... } }

Signature:

function factuality(config: { model: string }): Scorer;

Config:

{
  model: string; // OpenAI-compatible model ID (e.g., 'gpt-4o-mini')
}

Returns:

1.0 if output is factually consistent with expected
0.0 if output contradicts expected
Decimal between 0 and 1 for partial correctness
reason field contains LLM’s explanation
metadata includes additional details from autoevals

Requirements:

OPENAI_API_KEY environment variable
OpenAI-compatible API endpoint

Combinators

`all(...scorers)`

Weakest-link (minimum score):

const strict = all(exactMatch, includes);

const result = await strict({
  input: 'What is 2+2?',
  output: '4',
  expected: '4',
});
// { score: 1.0 } (both scorers passed)

Signature:

function all(...scorers: Scorer[]): Scorer;

Returns:

score: Minimum score of all scorers
reason: Concatenated reasons from all scorers (semicolon-separated)

`any(...scorers)`

Best-of (maximum score):

const lenient = any(exactMatch, includes);

const result = await lenient({
  input: 'What is the capital of France?',
  output: 'The capital is Paris.',
  expected: 'Paris',
});
// { score: 1.0 } (includes passed, even though exactMatch failed)

Signature:

function any(...scorers: Scorer[]): Scorer;

Returns:

score: Maximum score of all scorers
reason: Reason from the highest-scoring scorer

`weighted(config)`

Weighted average:

const balanced = weighted({
  accuracy: { scorer: exactMatch, weight: 2 },
  grounding: { scorer: factuality({ model: 'gpt-4o-mini' }), weight: 1 },
});

const result = await balanced({
  input: 'What is 2+2?',
  output: '4',
  expected: '4',
});
// { score: 1.0, reason: 'accuracy: 1.00 (w=2), grounding: 1.00 (w=1)' }

Signature:

function weighted(
  config: Record<string, { scorer: Scorer; weight: number }>
): Scorer;

Config:

{
  [name: string]: {
    scorer: Scorer;
    weight: number;
  }
}

Returns:

score: Weighted average sum(score * weight) / sum(weight)
reason: Lists all scorer scores and weights

Custom Scorers

Create custom scorers by implementing the Scorer type:

import type { Scorer } from '@deepagents/evals/scorers';

const lengthScorer: Scorer = async ({ output }) => {
  const score = output.length > 10 ? 1.0 : 0.5;
  return {
    score,
    reason: `Output length: ${output.length}`,
  };
};

Requirements:

Return a Promise<ScorerResult>
Score must be between 0 and 1
Optionally include reason and metadata

Examples

Using Multiple Scorers

import { evaluate, exactMatch, includes } from '@deepagents/evals';

await evaluate({
  // ...
  scorers: {
    exact: exactMatch,
    contains: includes,
  },
});

A case passes if all scorers return >= threshold.

Combining Scorers

import { all, any, weighted, exactMatch, includes, factuality } from '@deepagents/evals/scorers';

// All must pass
const strict = all(exactMatch, includes);

// At least one must pass
const lenient = any(exactMatch, includes);

// Weighted combination
const balanced = weighted({
  accuracy: { scorer: exactMatch, weight: 2 },
  grounding: { scorer: factuality({ model: 'gpt-4o-mini' }), weight: 1 },
});

Custom Scorer

import type { Scorer } from '@deepagents/evals/scorers';

const containsKeyword: Scorer = async ({ output }) => {
  const keywords = ['paris', 'france', 'capital'];
  const matches = keywords.filter((k) => output.toLowerCase().includes(k));
  return {
    score: matches.length / keywords.length,
    reason: `Matched ${matches.length}/${keywords.length} keywords`,
  };
};

Overview

Guides

API Reference

Scorers API

Scorers API

Import

Types

`Scorer`

`ScorerArgs`

`ScorerResult`

Deterministic Scorers

`exactMatch`

`includes`

`regex(pattern)`

`levenshtein`

`jsonMatch`

LLM-Based Scorers

`factuality(config)`

Combinators

`all(...scorers)`

`any(...scorers)`

`weighted(config)`

Custom Scorers

Examples

Using Multiple Scorers

Combining Scorers

Custom Scorer

Next Steps

Evaluate API

Scorers Guide

Build docs developers (and LLMs) love

Overview

Guides

API Reference

​Scorers API

​Import

​Types

​Scorer

​ScorerArgs

​ScorerResult

​Deterministic Scorers

​exactMatch

​includes

​regex(pattern)

​levenshtein

​jsonMatch

​LLM-Based Scorers

​factuality(config)

​Combinators

​all(...scorers)

​any(...scorers)

​weighted(config)

​Custom Scorers

​Examples

​Using Multiple Scorers

​Combining Scorers

​Custom Scorer

​Next Steps

Evaluate API

Scorers Guide

Build docs developers (and LLMs) love

Scorers API

Import

Types

`Scorer`

`ScorerArgs`

`ScorerResult`

Deterministic Scorers

`exactMatch`

`includes`

`regex(pattern)`

`levenshtein`

`jsonMatch`

LLM-Based Scorers

`factuality(config)`

Combinators

`all(...scorers)`

`any(...scorers)`

`weighted(config)`

Custom Scorers

Examples

Using Multiple Scorers

Combining Scorers

Custom Scorer

Next Steps