Skip to main content

Scorers

Scorers evaluate the quality of LLM outputs. All scorers return a ScorerResult:
interface ScorerResult {
  score: number;  // 0..1 (0 = worst, 1 = best)
  reason?: string;
  metadata?: Record<string, unknown>;
}

Deterministic Scorers

These scorers use rule-based logic and don’t require LLM calls.

exactMatch

Strict string equality between output and expected:
import { exactMatch } from '@deepagents/evals/scorers';

const result = await exactMatch({
  input: 'What is 2+2?',
  output: '4',
  expected: '4',
});
// { score: 1.0 }
Returns:
  • 1.0 if output exactly matches expected
  • 0.0 otherwise, with a reason explaining the mismatch

includes

Substring check — passes if output contains expected:
import { includes } from '@deepagents/evals/scorers';

const result = await includes({
  input: 'What is the capital of France?',
  output: 'The capital of France is Paris.',
  expected: 'Paris',
});
// { score: 1.0 }
Returns:
  • 1.0 if output includes expected as a substring
  • 0.0 otherwise

regex(pattern)

Regular expression test:
import { regex } from '@deepagents/evals/scorers';

const emailScorer = regex(/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$/i);

const result = await emailScorer({
  input: 'Extract the email',
  output: '[email protected]',
});
// { score: 1.0 }
Returns:
  • 1.0 if pattern matches
  • 0.0 otherwise

levenshtein

Normalized edit distance similarity:
import { levenshtein } from '@deepagents/evals/scorers';

const result = await levenshtein({
  input: 'Spell "hello"',
  output: 'helo',
  expected: 'hello',
});
// { score: 0.8 } (80% similar)
Returns:
  • 1.0 for exact match
  • 0.0 for completely different strings
  • Decimal between 0 and 1 for partial similarity

jsonMatch

Deep structural equality for JSON objects:
import { jsonMatch } from '@deepagents/evals/scorers';

const result = await jsonMatch({
  input: 'Generate JSON',
  output: '{"name":"Alice","age":30}',
  expected: { name: 'Alice', age: 30 },
});
// { score: 1.0 }
Returns:
  • 1.0 if JSON structures are deeply equal (order-independent for objects)
  • 0.0 if structures differ or JSON is invalid

LLM-Based Scorers

These scorers use LLMs to evaluate output quality.

factuality(config)

Checks if the output is factually correct given the expected value:
import { factuality } from '@deepagents/evals/scorers';

const factScorer = factuality({ model: 'gpt-4o-mini' });

const result = await factScorer({
  input: 'What is the capital of France?',
  output: 'Paris is the capital and largest city of France.',
  expected: 'Paris',
});
// { score: 1.0, reason: 'Output is factually correct' }
Config:
{
  model: string; // OpenAI-compatible model ID
}
Returns:
  • 1.0 if output is factually consistent with expected
  • 0.0 if output contradicts expected
  • Decimal between 0 and 1 for partial correctness
  • reason field contains LLM’s explanation
The factuality scorer uses the autoevals library and requires an OPENAI_API_KEY environment variable.

Combinators

Combinators compose multiple scorers into one.

all(...scorers)

Weakest-link (minimum score) — All scorers must pass:
import { all, exactMatch, includes } from '@deepagents/evals/scorers';

const strict = all(exactMatch, includes);

const result = await strict({
  input: 'What is 2+2?',
  output: '4',
  expected: '4',
});
// { score: 1.0 } (both scorers passed)
Returns:
  • The minimum score of all scorers
  • Concatenated reasons from all scorers

any(...scorers)

Best-of (maximum score) — At least one scorer must pass:
import { any, exactMatch, includes } from '@deepagents/evals/scorers';

const lenient = any(exactMatch, includes);

const result = await lenient({
  input: 'What is the capital of France?',
  output: 'The capital is Paris.',
  expected: 'Paris',
});
// { score: 1.0 } (includes passed)
Returns:
  • The maximum score of all scorers
  • Reason from the highest-scoring scorer

weighted(config)

Weighted average — Combine scorers with different weights:
import { weighted, exactMatch, factuality } from '@deepagents/evals/scorers';

const balanced = weighted({
  accuracy: { scorer: exactMatch, weight: 2 },
  grounding: { scorer: factuality({ model: 'gpt-4o-mini' }), weight: 1 },
});

const result = await balanced({
  input: 'What is 2+2?',
  output: '4',
  expected: '4',
});
// { score: 1.0, reason: 'accuracy: 1.00 (w=2), grounding: 1.00 (w=1)' }
Config:
{
  [name: string]: {
    scorer: Scorer;
    weight: number;
  }
}
Returns:
  • Weighted average: sum(score * weight) / sum(weight)
  • Reason lists all scorer scores and weights

Custom Scorers

You can create custom scorers by implementing the Scorer type:
import type { Scorer } from '@deepagents/evals/scorers';

const myScorer: Scorer = async ({ input, output, expected }) => {
  // Your scoring logic here
  const score = output.length > 10 ? 1.0 : 0.5;
  return { score, reason: `Output length: ${output.length}` };
};
Type signature:
type Scorer = (args: ScorerArgs) => Promise<ScorerResult>;

interface ScorerArgs {
  input: unknown;
  output: string;
  expected?: unknown;
}

interface ScorerResult {
  score: number;  // Must be 0..1
  reason?: string;
  metadata?: Record<string, unknown>;
}
Scorer scores must be between 0 and 1. Out-of-range scores will be clamped and logged as warnings.

Usage in Evaluation

Scorers are passed to evaluate() as a record:
import { evaluate, exactMatch, includes } from '@deepagents/evals';

await evaluate({
  // ...
  scorers: {
    exact: exactMatch,
    contains: includes,
    custom: myScorer,
  },
});
Each scorer runs independently. A case passes if all scorers return >= threshold.

Scorer Comparison

ScorerSpeedLLM RequiredUse Case
exactMatch⚡️ InstantNoExact string matching
includes⚡️ InstantNoSubstring presence
regex⚡️ InstantNoPattern matching
levenshtein⚡️ FastNoFuzzy string similarity
jsonMatch⚡️ FastNoJSON structure equality
factuality🐢 SlowYesSemantic correctness

Next Steps

API Reference

Full scorer API documentation

Comparison

Compare runs with scorer deltas

Build docs developers (and LLMs) love