Datasets

Datasets in @deepagents/evals are lazy, chainable iterables that load test cases for evaluation.

Loading Datasets

From Arrays

The simplest way to create a dataset is from an inline array:

import { dataset } from '@deepagents/evals/dataset';

const ds = dataset([
  { input: 'What is 2+2?', expected: '4' },
  { input: 'What is 3+3?', expected: '6' },
]);

From JSON Files

Load from a JSON file containing an array:

questions.json

[
  { "input": "What is 2+2?", "expected": "4" },
  { "input": "What is 3+3?", "expected": "6" }
]

const ds = dataset('./questions.json');

From JSONL Files

Load from a JSONL (JSON Lines) file where each line is a JSON object:

questions.jsonl

{"input": "What is 2+2?", "expected": "4"}
{"input": "What is 3+3?", "expected": "6"}

const ds = dataset('./questions.jsonl');

JSONL is recommended for large datasets because it’s memory-efficient and lazy-loaded.

From CSV Files

Load from a CSV file with a header row:

questions.csv

input,expected
"What is 2+2?","4"
"What is 3+3?","6"

const ds = dataset('./questions.csv');

CSV files are parsed row-by-row. Each row becomes a Record<string, string> object.

From AsyncIterable

You can also wrap any AsyncIterable:

async function* loadFromDB() {
  const rows = await db.query('SELECT * FROM test_cases');
  for (const row of rows) {
    yield { input: row.question, expected: row.answer };
  }
}

const ds = dataset(loadFromDB());

Dataset Schema

Each dataset item should have at minimum:

interface DatasetItem {
  input: unknown;     // Passed to the task function
  expected?: unknown; // Passed to scorers for comparison
}

You can include any additional fields:

const ds = dataset([
  { 
    input: 'What is 2+2?', 
    expected: '4',
    difficulty: 'easy',
    category: 'arithmetic',
  },
]);

Transform Methods

Datasets support chainable transforms. These are lazy — transforms are applied during iteration, not when called.

`map(fn)`

Transform each item:

const ds = dataset('./raw-data.json')
  .map((row) => ({
    input: row.question,
    expected: row.answer,
  }));

Lazy: ✅ Items are transformed as they’re consumed.

`filter(fn)`

Exclude items that don’t match a predicate:

const ds = dataset('./data.json')
  .filter((row) => row.difficulty === 'hard');

Lazy: ✅ Items are filtered as they’re consumed.

`limit(n)`

Cap the dataset at n items:

const ds = dataset('./large-dataset.jsonl')
  .limit(100); // Only use the first 100 items

Lazy: ✅ Stops iteration after n items.

`shuffle()`

Randomize the order of items:

const ds = dataset('./data.json')
  .shuffle();

Eager: ⚠️ Buffers all items into memory before shuffling.

`sample(n)`

Pick n random items:

const ds = dataset('./large-dataset.jsonl')
  .sample(50); // Pick 50 random items

Eager: ⚠️ Buffers all items into memory before sampling.

`toArray()`

Consume the dataset into a plain array:

const items = await dataset('./data.json').toArray();
console.log(items.length);

Eager: ⚠️ Loads all items into memory.

Chaining Transforms

Transforms can be chained together:

const ds = dataset('./large-dataset.jsonl')
  .filter((row) => row.difficulty === 'hard')
  .map((row) => ({ input: row.question, expected: row.answer }))
  .shuffle()
  .limit(100);

Evaluation order:

Load from file (lazy)
Filter by difficulty (lazy)
Map to input/expected (lazy)
Shuffle (eager — buffers all filtered items)
Limit to 100 (lazy)

Putting shuffle() or sample() before limit() will buffer the entire dataset. Put limit() first to reduce memory usage:

// ❌ Bad: Shuffles all 10M items
dataset('./huge.jsonl').shuffle().limit(100);

// ✅ Good: Shuffles only first 1000 items
dataset('./huge.jsonl').limit(1000).shuffle().limit(100);

Hugging Face Datasets

You can also load datasets from Hugging Face:

import { hf } from '@deepagents/evals/dataset';

const ds = hf({
  repo: 'rajpurkar/squad',
  split: 'validation',
  limit: 100,
});

See the dataset API reference for more details.

Record Selection

You can select specific cases by index using the pick() method:

import { parseRecordSelection } from '@deepagents/evals/dataset';

const { indexes } = parseRecordSelection('0-10,15,20-25');
const ds = dataset('./data.json').pick(indexes);

Supported formats:

0-10 — Range from 0 to 10 (inclusive)
5 — Single index
0-10,15,20-25 — Multiple ranges and indexes

This is used internally by evaluate().cases('0-10,15').

Performance Tips

Use JSONL for Large Datasets

JSONL is streamed line-by-line, so it’s memory-efficient:

// ✅ Good: Lazy-loads 1M items
const ds = dataset('./1M-items.jsonl').limit(100);

// ❌ Bad: Loads entire 1M-item array into memory
const ds = dataset('./1M-items.json').limit(100);

Filter Early

Put filter() as early as possible to reduce downstream processing:

// ✅ Good: Filter before expensive map
dataset('./data.jsonl')
  .filter((row) => row.difficulty === 'hard')
  .map(expensiveTransform);

// ❌ Bad: Map before filter
dataset('./data.jsonl')
  .map(expensiveTransform)
  .filter((row) => row.difficulty === 'hard');

Avoid Multiple Shuffles

Each shuffle() buffers the dataset into memory:

// ❌ Bad: Buffers twice
dataset('./data.jsonl')
  .shuffle()
  .filter(somePredicate)
  .shuffle(); // Buffers again

// ✅ Good: Shuffle once
dataset('./data.jsonl')
  .filter(somePredicate)
  .shuffle(); // Single buffer

Overview

Guides

API Reference

Datasets

Datasets

Loading Datasets

From Arrays

From JSON Files

From JSONL Files

From CSV Files

From AsyncIterable

Dataset Schema

Transform Methods

`map(fn)`

`filter(fn)`

`limit(n)`

`shuffle()`

`sample(n)`

`toArray()`

Chaining Transforms

Hugging Face Datasets

Record Selection

Performance Tips

Use JSONL for Large Datasets

Filter Early

Avoid Multiple Shuffles

Next Steps

Scorers

API Reference

Build docs developers (and LLMs) love

Overview

Guides

API Reference

​Datasets

​Loading Datasets

​From Arrays

​From JSON Files

​From JSONL Files

​From CSV Files

​From AsyncIterable

​Dataset Schema

​Transform Methods

​map(fn)

​filter(fn)

​limit(n)

​shuffle()

​sample(n)

​toArray()

​Chaining Transforms

​Hugging Face Datasets

​Record Selection

​Performance Tips

​Use JSONL for Large Datasets

​Filter Early

​Avoid Multiple Shuffles

​Next Steps

Scorers

API Reference

Build docs developers (and LLMs) love

Datasets

Loading Datasets

From Arrays

From JSON Files

From JSONL Files

From CSV Files

From AsyncIterable

Dataset Schema

Transform Methods

`map(fn)`

`filter(fn)`

`limit(n)`

`shuffle()`

`sample(n)`

`toArray()`

Chaining Transforms

Hugging Face Datasets

Record Selection

Performance Tips

Use JSONL for Large Datasets

Filter Early

Avoid Multiple Shuffles

Next Steps