Skip to main content

Persistence

The RunStore class provides SQLite-backed persistence for evaluation runs, cases, and scores.

Creating a Store

import { RunStore } from '@deepagents/evals/store';

// Default location: .evals/store.db
const store = new RunStore();

// Custom location
const store = new RunStore('./my-evals/results.db');

// In-memory (for testing)
import { DatabaseSync } from 'node:sqlite';
const db = new DatabaseSync(':memory:');
const store = new RunStore(db);
The directory is created automatically if it doesn’t exist.

Database Schema

The store manages four main tables:

Suites

A suite groups related runs together (e.g., all runs for a specific evaluation):
interface SuiteRow {
  id: string;
  name: string;
  created_at: number;
}

Runs

A run represents a single evaluation execution:
interface RunRow {
  id: string;
  suite_id: string;
  name: string;
  model: string;
  config: Record<string, unknown> | null;
  started_at: number;
  finished_at: number | null;
  status: 'running' | 'completed' | 'failed';
  summary: RunSummary | null;
}

Cases

A case is a single test item in a run:
interface CaseRow {
  id: string;
  run_id: string;
  idx: number;
  input: unknown;
  output: string | null;
  expected: unknown | null;
  latency_ms: number;
  tokens_in: number;
  tokens_out: number;
  error: string | null;
}

Scores

A score is the result of running a scorer on a case:
interface ScoreRow {
  id: string;
  case_id: string;
  scorer_name: string;
  score: number;
  reason: string | null;
}

Suites

Create a Suite

const suite = store.createSuite('text2sql-accuracy');
// { id: '...', name: 'text2sql-accuracy', created_at: 1234567890 }

Find Suite by Name

const suite = store.findSuiteByName('text2sql-accuracy');
if (suite) {
  console.log(suite.id);
}

Get Suite by ID

const suite = store.getSuite(suiteId);

List All Suites

const suites = store.listSuites();
for (const suite of suites) {
  console.log(suite.name);
}

Rename a Suite

store.renameSuite(suiteId, 'new-name');

Runs

Create a Run

const runId = store.createRun({
  suite_id: suite.id,
  name: 'my-eval',
  model: 'gpt-4o',
  config: { temperature: 0.7 },
});

Finish a Run

Mark a run as completed or failed:
store.finishRun(runId, 'completed', summary);

Get a Run

const run = store.getRun(runId);
if (run) {
  console.log(run.status);
}

List Runs

List all runs or filter by suite:
// All runs
const runs = store.listRuns();

// Runs in a specific suite
const runs = store.listRuns(suiteId);

Get Latest Completed Run

Get the most recent completed run for a suite:
const run = store.getLatestCompletedRun(suiteId);

// Filter by model
const run = store.getLatestCompletedRun(suiteId, 'gpt-4o');

Rename a Run

store.renameRun(runId, 'new-name');

Cases

Save Cases

Save one or more cases:
store.saveCases([
  {
    id: caseId,
    run_id: runId,
    idx: 0,
    input: { question: 'What is 2+2?' },
    output: '4',
    expected: '4',
    latency_ms: 150,
    tokens_in: 10,
    tokens_out: 2,
  },
]);

Get Cases

Get all cases for a run:
const cases = store.getCases(runId);
for (const c of cases) {
  console.log(c.idx, c.output);
}

Get Failing Cases

Get cases that scored below a threshold:
const failing = store.getFailingCases(runId, 0.5);
for (const c of failing) {
  console.log(`Case #${c.idx} failed with scores:`, c.scores);
}
Returns: CaseWithScores[]
interface CaseWithScores extends CaseRow {
  scores: Array<{ scorer_name: string; score: number; reason: string | null }>;
}

Scores

Save Scores

store.saveScores([
  {
    id: scoreId,
    case_id: caseId,
    scorer_name: 'exact',
    score: 1.0,
    reason: undefined,
  },
]);

Summaries

Get Run Summary

Compute aggregated statistics for a run:
const summary = store.getRunSummary(runId, 0.5);
// {
//   totalCases: 100,
//   passCount: 85,
//   failCount: 15,
//   meanScores: { exact: 0.92, factual: 0.88 },
//   totalLatencyMs: 12340,
//   totalTokensIn: 1024,
//   totalTokensOut: 512,
// }
Parameters:
  • runId — Run ID
  • threshold — Minimum score to count as “pass” (default: 0.5)
Returns: RunSummary
interface RunSummary {
  totalCases: number;
  passCount: number;
  failCount: number;
  meanScores: Record<string, number>;
  totalLatencyMs: number;
  totalTokensIn: number;
  totalTokensOut: number;
}

Prompts (Experimental)

The store also supports versioned prompts:

Create a Prompt

const prompt = store.createPrompt('my-prompt', 'You are a helpful assistant.');
// { id: '...', name: 'my-prompt', version: 1, content: '...', created_at: ... }

// Creating again increments version
const prompt2 = store.createPrompt('my-prompt', 'Updated prompt.');
// { id: '...', name: 'my-prompt', version: 2, content: '...', created_at: ... }

List Prompts

const prompts = store.listPrompts();
for (const p of prompts) {
  console.log(`${p.name} v${p.version}`);
}

Get Prompt by ID

const prompt = store.getPrompt(promptId);

Delete a Prompt

store.deletePrompt(promptId);

Transactions

The store uses transactions internally for batch operations. You don’t need to manage transactions manually.

Migrations

The store automatically migrates older database schemas to the latest version:
  • Prompts versioning — Adds version column to prompts table
  • Suite foreign keys — Adds ON DELETE CASCADE to runs.suite_id
Migrations run automatically on store initialization.

Example: Querying Historical Data

import { RunStore } from '@deepagents/evals/store';

const store = new RunStore('.evals/store.db');

const suite = store.findSuiteByName('text2sql-accuracy');
if (!suite) throw new Error('Suite not found');

const runs = store.listRuns(suite.id);
console.log(`Found ${runs.length} runs`);

for (const run of runs) {
  if (run.status === 'completed') {
    const summary = store.getRunSummary(run.id);
    console.log(`${run.model}: ${summary.passCount}/${summary.totalCases} passed`);
  }
}

Next Steps

Comparison

Compare runs and detect regressions

API Reference

Full RunStore API documentation

Build docs developers (and LLMs) love