Datasets
Datasets in @deepagents/evals are lazy, chainable iterables that load test cases for evaluation.
Loading Datasets
From Arrays
The simplest way to create a dataset is from an inline array:
import { dataset } from '@deepagents/evals/dataset' ;
const ds = dataset ([
{ input: 'What is 2+2?' , expected: '4' },
{ input: 'What is 3+3?' , expected: '6' },
]);
From JSON Files
Load from a JSON file containing an array:
[
{ "input" : "What is 2+2?" , "expected" : "4" },
{ "input" : "What is 3+3?" , "expected" : "6" }
]
const ds = dataset ( './questions.json' );
From JSONL Files
Load from a JSONL (JSON Lines) file where each line is a JSON object:
{ "input" : "What is 2+2?" , "expected" : "4" }
{ "input" : "What is 3+3?" , "expected" : "6" }
const ds = dataset ( './questions.jsonl' );
JSONL is recommended for large datasets because it’s memory-efficient and lazy-loaded.
From CSV Files
Load from a CSV file with a header row:
input, expected
"What is 2+2?", "4"
"What is 3+3?", "6"
const ds = dataset ( './questions.csv' );
CSV files are parsed row-by-row. Each row becomes a Record<string, string> object.
From AsyncIterable
You can also wrap any AsyncIterable:
async function* loadFromDB () {
const rows = await db . query ( 'SELECT * FROM test_cases' );
for ( const row of rows ) {
yield { input: row . question , expected: row . answer };
}
}
const ds = dataset ( loadFromDB ());
Dataset Schema
Each dataset item should have at minimum:
interface DatasetItem {
input : unknown ; // Passed to the task function
expected ?: unknown ; // Passed to scorers for comparison
}
You can include any additional fields:
const ds = dataset ([
{
input: 'What is 2+2?' ,
expected: '4' ,
difficulty: 'easy' ,
category: 'arithmetic' ,
},
]);
Datasets support chainable transforms. These are lazy — transforms are applied during iteration, not when called.
map(fn)
Transform each item:
const ds = dataset ( './raw-data.json' )
. map (( row ) => ({
input: row . question ,
expected: row . answer ,
}));
Lazy: ✅ Items are transformed as they’re consumed.
filter(fn)
Exclude items that don’t match a predicate:
const ds = dataset ( './data.json' )
. filter (( row ) => row . difficulty === 'hard' );
Lazy: ✅ Items are filtered as they’re consumed.
limit(n)
Cap the dataset at n items:
const ds = dataset ( './large-dataset.jsonl' )
. limit ( 100 ); // Only use the first 100 items
Lazy: ✅ Stops iteration after n items.
shuffle()
Randomize the order of items:
const ds = dataset ( './data.json' )
. shuffle ();
Eager: ⚠️ Buffers all items into memory before shuffling.
sample(n)
Pick n random items:
const ds = dataset ( './large-dataset.jsonl' )
. sample ( 50 ); // Pick 50 random items
Eager: ⚠️ Buffers all items into memory before sampling.
toArray()
Consume the dataset into a plain array:
const items = await dataset ( './data.json' ). toArray ();
console . log ( items . length );
Eager: ⚠️ Loads all items into memory.
Transforms can be chained together:
const ds = dataset ( './large-dataset.jsonl' )
. filter (( row ) => row . difficulty === 'hard' )
. map (( row ) => ({ input: row . question , expected: row . answer }))
. shuffle ()
. limit ( 100 );
Evaluation order:
Load from file (lazy)
Filter by difficulty (lazy)
Map to input/expected (lazy)
Shuffle (eager — buffers all filtered items)
Limit to 100 (lazy)
Putting shuffle() or sample() before limit() will buffer the entire dataset. Put limit() first to reduce memory usage: // ❌ Bad: Shuffles all 10M items
dataset ( './huge.jsonl' ). shuffle (). limit ( 100 );
// ✅ Good: Shuffles only first 1000 items
dataset ( './huge.jsonl' ). limit ( 1000 ). shuffle (). limit ( 100 );
Hugging Face Datasets
You can also load datasets from Hugging Face:
import { hf } from '@deepagents/evals/dataset' ;
const ds = hf ({
repo: 'rajpurkar/squad' ,
split: 'validation' ,
limit: 100 ,
});
See the dataset API reference for more details.
Record Selection
You can select specific cases by index using the pick() method:
import { parseRecordSelection } from '@deepagents/evals/dataset' ;
const { indexes } = parseRecordSelection ( '0-10,15,20-25' );
const ds = dataset ( './data.json' ). pick ( indexes );
Supported formats:
0-10 — Range from 0 to 10 (inclusive)
5 — Single index
0-10,15,20-25 — Multiple ranges and indexes
This is used internally by evaluate().cases('0-10,15').
Use JSONL for Large Datasets
JSONL is streamed line-by-line, so it’s memory-efficient:
// ✅ Good: Lazy-loads 1M items
const ds = dataset ( './1M-items.jsonl' ). limit ( 100 );
// ❌ Bad: Loads entire 1M-item array into memory
const ds = dataset ( './1M-items.json' ). limit ( 100 );
Filter Early
Put filter() as early as possible to reduce downstream processing:
// ✅ Good: Filter before expensive map
dataset ( './data.jsonl' )
. filter (( row ) => row . difficulty === 'hard' )
. map ( expensiveTransform );
// ❌ Bad: Map before filter
dataset ( './data.jsonl' )
. map ( expensiveTransform )
. filter (( row ) => row . difficulty === 'hard' );
Avoid Multiple Shuffles
Each shuffle() buffers the dataset into memory:
// ❌ Bad: Buffers twice
dataset ( './data.jsonl' )
. shuffle ()
. filter ( somePredicate )
. shuffle (); // Buffers again
// ✅ Good: Shuffle once
dataset ( './data.jsonl' )
. filter ( somePredicate )
. shuffle (); // Single buffer
Next Steps
Scorers Learn about scoring functions
API Reference Full dataset API documentation