Quickstart
This guide walks you through running your first LLM evaluation with @deepagents/evals.
Basic Example
Here’s a complete evaluation that tests a simple question-answering task:
import { dataset , evaluate , exactMatch } from '@deepagents/evals' ;
import { consoleReporter } from '@deepagents/evals/reporters' ;
import { RunStore } from '@deepagents/evals/store' ;
// Define your task function
const task = async ( item : { input : string ; expected : string }) => {
// Call your LLM here
const response = await callMyLLM ( item . input );
return { output: response };
};
// Run the evaluation
const summary = await evaluate ({
name: 'my-first-eval' ,
model: 'gpt-4o' ,
dataset: dataset ([
{ input: 'What is 2+2?' , expected: '4' },
{ input: 'What is 3+3?' , expected: '6' },
]),
task ,
scorers: { exact: exactMatch },
reporters: [ consoleReporter ()],
store: new RunStore (),
});
console . log ( summary );
Step-by-Step Breakdown
1. Import Dependencies
import { dataset , evaluate , exactMatch } from '@deepagents/evals' ;
import { consoleReporter } from '@deepagents/evals/reporters' ;
import { RunStore } from '@deepagents/evals/store' ;
2. Create a Dataset
The dataset is an array of test cases. Each case should have an input and an expected value:
const ds = dataset ([
{ input: 'What is 2+2?' , expected: '4' },
{ input: 'What is 3+3?' , expected: '6' },
{ input: 'What is the capital of France?' , expected: 'Paris' },
]);
You can also load from files:
const ds = dataset ( './questions.json' );
3. Define a Task Function
The task function calls your LLM and returns the output:
const task = async ( item : { input : string ; expected : string }) => {
// Example using OpenAI
const completion = await openai . chat . completions . create ({
model: 'gpt-4o' ,
messages: [{ role: 'user' , content: item . input }],
});
return {
output: completion . choices [ 0 ] ! . message . content ! ,
usage: {
inputTokens: completion . usage ! . prompt_tokens ,
outputTokens: completion . usage ! . completion_tokens ,
},
};
};
The task function should return { output: string, usage?: { inputTokens: number, outputTokens: number } }.
4. Choose Scorers
Scorers evaluate the quality of the output. Use built-in scorers or create custom ones:
import { exactMatch , includes } from '@deepagents/evals/scorers' ;
const scorers = {
exact: exactMatch ,
contains: includes ,
};
See the Scorers guide for all available options.
5. Set Up Reporters
Reporters display evaluation results:
import { consoleReporter } from '@deepagents/evals/reporters' ;
const reporters = [
consoleReporter ({ verbosity: 'normal' }),
];
6. Create a Run Store
The run store persists results to SQLite:
import { RunStore } from '@deepagents/evals/store' ;
const store = new RunStore ( '.evals/store.db' );
7. Run the Evaluation
Put it all together with evaluate():
const summary = await evaluate ({
name: 'my-first-eval' ,
model: 'gpt-4o' ,
dataset: ds ,
task ,
scorers ,
reporters ,
store ,
});
console . log ( summary );
Understanding the Output
The console reporter displays a progress bar and summary:
── my-first-eval ──────────────────────────────────────────
Running 3 cases...
Summary
────────────────────────────────────────────────────────────
Eval: my-first-eval
Model: gpt-4o
Threshold: 0.5
Cases: 3
Pass/Fail: 2 / 1 (66.7%)
Duration: 1.2s
Tokens: In: 45 Out: 12 Total: 57
Scores
exact: 0.667
────────────────────────────────────────────────────────────
The summary object contains detailed metrics:
interface RunSummary {
totalCases : number ;
passCount : number ;
failCount : number ;
meanScores : Record < string , number >;
totalLatencyMs : number ;
totalTokensIn : number ;
totalTokensOut : number ;
}
Advanced Configuration
Concurrency
Control how many cases run in parallel:
await evaluate ({
// ...
maxConcurrency: 5 , // Run 5 cases at a time
});
Timeout
Set a per-case timeout:
await evaluate ({
// ...
timeout: 30_000 , // 30 seconds per case
});
Threshold
Set the minimum passing score:
await evaluate ({
// ...
threshold: 0.8 , // Require 80% score to pass
});
Trials
Run each case multiple times to reduce variance:
await evaluate ({
// ...
trials: 3 , // Run each case 3 times, average the scores
});
Builder Pattern
The evaluate() function returns an EvalBuilder that supports chaining:
// Run only failed cases from the previous run
await evaluate ( options ). failed ();
// Run specific cases
await evaluate ( options ). cases ( '0-10,15,20-25' );
// Run a random sample
await evaluate ( options ). sample ( 50 );
// Throw if any cases fail
await evaluate ( options ). assert ();
What’s Next?
Datasets Learn about dataset loading and transforms
Scorers Explore all built-in scorers
Comparison Compare runs and detect regressions
API Reference Full API documentation