Scorers
Scorers evaluate the quality of LLM outputs. All scorers return a ScorerResult:
interface ScorerResult {
score : number ; // 0..1 (0 = worst, 1 = best)
reason ?: string ;
metadata ?: Record < string , unknown >;
}
Deterministic Scorers
These scorers use rule-based logic and don’t require LLM calls.
exactMatch
Strict string equality between output and expected:
import { exactMatch } from '@deepagents/evals/scorers' ;
const result = await exactMatch ({
input: 'What is 2+2?' ,
output: '4' ,
expected: '4' ,
});
// { score: 1.0 }
Returns:
1.0 if output exactly matches expected
0.0 otherwise, with a reason explaining the mismatch
includes
Substring check — passes if output contains expected:
import { includes } from '@deepagents/evals/scorers' ;
const result = await includes ({
input: 'What is the capital of France?' ,
output: 'The capital of France is Paris.' ,
expected: 'Paris' ,
});
// { score: 1.0 }
Returns:
1.0 if output includes expected as a substring
0.0 otherwise
regex(pattern)
Regular expression test:
import { regex } from '@deepagents/evals/scorers' ;
const emailScorer = regex ( / ^ [ a-z0-9._%+- ] + @ [ a-z0-9.- ] + \. [ a-z ] {2,} $ / i );
const result = await emailScorer ({
input: 'Extract the email' ,
output: '[email protected] ' ,
});
// { score: 1.0 }
Returns:
1.0 if pattern matches
0.0 otherwise
levenshtein
Normalized edit distance similarity:
import { levenshtein } from '@deepagents/evals/scorers' ;
const result = await levenshtein ({
input: 'Spell "hello"' ,
output: 'helo' ,
expected: 'hello' ,
});
// { score: 0.8 } (80% similar)
Returns:
1.0 for exact match
0.0 for completely different strings
Decimal between 0 and 1 for partial similarity
jsonMatch
Deep structural equality for JSON objects:
import { jsonMatch } from '@deepagents/evals/scorers' ;
const result = await jsonMatch ({
input: 'Generate JSON' ,
output: '{"name":"Alice","age":30}' ,
expected: { name: 'Alice' , age: 30 },
});
// { score: 1.0 }
Returns:
1.0 if JSON structures are deeply equal (order-independent for objects)
0.0 if structures differ or JSON is invalid
LLM-Based Scorers
These scorers use LLMs to evaluate output quality.
factuality(config)
Checks if the output is factually correct given the expected value:
import { factuality } from '@deepagents/evals/scorers' ;
const factScorer = factuality ({ model: 'gpt-4o-mini' });
const result = await factScorer ({
input: 'What is the capital of France?' ,
output: 'Paris is the capital and largest city of France.' ,
expected: 'Paris' ,
});
// { score: 1.0, reason: 'Output is factually correct' }
Config:
{
model : string ; // OpenAI-compatible model ID
}
Returns:
1.0 if output is factually consistent with expected
0.0 if output contradicts expected
Decimal between 0 and 1 for partial correctness
reason field contains LLM’s explanation
The factuality scorer uses the autoevals library and requires an OPENAI_API_KEY environment variable.
Combinators
Combinators compose multiple scorers into one.
all(...scorers)
Weakest-link (minimum score) — All scorers must pass:
import { all , exactMatch , includes } from '@deepagents/evals/scorers' ;
const strict = all ( exactMatch , includes );
const result = await strict ({
input: 'What is 2+2?' ,
output: '4' ,
expected: '4' ,
});
// { score: 1.0 } (both scorers passed)
Returns:
The minimum score of all scorers
Concatenated reasons from all scorers
any(...scorers)
Best-of (maximum score) — At least one scorer must pass:
import { any , exactMatch , includes } from '@deepagents/evals/scorers' ;
const lenient = any ( exactMatch , includes );
const result = await lenient ({
input: 'What is the capital of France?' ,
output: 'The capital is Paris.' ,
expected: 'Paris' ,
});
// { score: 1.0 } (includes passed)
Returns:
The maximum score of all scorers
Reason from the highest-scoring scorer
weighted(config)
Weighted average — Combine scorers with different weights:
import { weighted , exactMatch , factuality } from '@deepagents/evals/scorers' ;
const balanced = weighted ({
accuracy: { scorer: exactMatch , weight: 2 },
grounding: { scorer: factuality ({ model: 'gpt-4o-mini' }), weight: 1 },
});
const result = await balanced ({
input: 'What is 2+2?' ,
output: '4' ,
expected: '4' ,
});
// { score: 1.0, reason: 'accuracy: 1.00 (w=2), grounding: 1.00 (w=1)' }
Config:
{
[ name : string ]: {
scorer: Scorer ;
weight : number ;
}
}
Returns:
Weighted average: sum(score * weight) / sum(weight)
Reason lists all scorer scores and weights
Custom Scorers
You can create custom scorers by implementing the Scorer type:
import type { Scorer } from '@deepagents/evals/scorers' ;
const myScorer : Scorer = async ({ input , output , expected }) => {
// Your scoring logic here
const score = output . length > 10 ? 1.0 : 0.5 ;
return { score , reason: `Output length: ${ output . length } ` };
};
Type signature:
type Scorer = ( args : ScorerArgs ) => Promise < ScorerResult >;
interface ScorerArgs {
input : unknown ;
output : string ;
expected ?: unknown ;
}
interface ScorerResult {
score : number ; // Must be 0..1
reason ?: string ;
metadata ?: Record < string , unknown >;
}
Scorer scores must be between 0 and 1. Out-of-range scores will be clamped and logged as warnings.
Usage in Evaluation
Scorers are passed to evaluate() as a record:
import { evaluate , exactMatch , includes } from '@deepagents/evals' ;
await evaluate ({
// ...
scorers: {
exact: exactMatch ,
contains: includes ,
custom: myScorer ,
},
});
Each scorer runs independently. A case passes if all scorers return >= threshold.
Scorer Comparison
Scorer Speed LLM Required Use Case exactMatch⚡️ Instant No Exact string matching includes⚡️ Instant No Substring presence regex⚡️ Instant No Pattern matching levenshtein⚡️ Fast No Fuzzy string similarity jsonMatch⚡️ Fast No JSON structure equality factuality🐢 Slow Yes Semantic correctness
Next Steps
API Reference Full scorer API documentation
Comparison Compare runs with scorer deltas