Overview
Dataloaders abstract away dataset-specific formats and provide:- Catalog-based loader creation - Factory pattern for consistent instantiation
- Normalized record format - All datasets produce
DatasetRecordobjects - Framework conversion - Automatic conversion to Haystack or LangChain documents
- Evaluation queries - Ground-truth QA pairs for retrieval benchmarking
- Streaming support - Memory-efficient iteration over large datasets
Supported datasets
| Dataset | Type | Records | Queries | Description |
|---|---|---|---|---|
TriviaQA (triviaqa) | Open-domain QA | ~500 index | ~100 eval | Trivia questions with evidence documents |
ARC (arc) | Science QA | ~1000 index | ~200 eval | AI2 Reasoning Challenge questions |
PopQA (popqa) | Entity-centric QA | ~500 index | ~100 eval | Entity-focused questions from Wikipedia |
FActScore (factscore) | Factuality QA | ~500 index | ~100 eval | Factuality-focused evaluation dataset |
Earnings Calls (earnings_calls) | Financial QA | ~300 index | ~50 eval | Financial QA from earnings call transcripts |
Architecture
Core components
Class hierarchy
Basic usage
Creating a loader
Use theDataloaderCatalog factory to create loaders:
Loading datasets
Converting to framework documents
Data structures
DatasetRecord
Normalized document record with text and metadata:EvaluationQuery
Evaluation query with ground-truth answers and relevant document IDs:LoadedDataset
Wrapper containing loaded records and metadata:Dataset implementations
TriviaQA
Dataset ID:triviaqaHuggingFace:
trivia_qaStructure: Questions with multiple evidence documents
text: Evidence document contentmetadata.id: Document identifiermetadata.title: Document titlemetadata.question: Associated question
ARC (AI2 Reasoning Challenge)
Dataset ID:arcHuggingFace:
ai2_arcStructure: Science questions with multiple-choice answers
text: Question + answer choicesmetadata.id: Question identifiermetadata.question: Question textmetadata.answerKey: Correct answer
PopQA
Dataset ID:popqaHuggingFace:
akariasai/PopQAStructure: Entity-centric questions from Wikipedia
text: Wikipedia passagemetadata.id: Passage identifiermetadata.entity: Entity mentionmetadata.question: Associated question
FActScore
Dataset ID:factscoreHuggingFace:
dskar/FActScoreStructure: Factuality-focused QA pairs
text: Factual statementmetadata.id: Statement identifiermetadata.topic: Topic/category
Earnings Calls
Dataset ID:earnings_callsHuggingFace:
lamini/earnings-calls-qaStructure: Financial QA from earnings call transcripts
text: Transcript excerptmetadata.id: Excerpt identifiermetadata.company: Company namemetadata.quarter: Reporting quarter
Base loader interface
All loaders implement theBaseDatasetLoader abstract class:
Document conversion
TheDocumentConverter class provides framework-specific conversion:
Haystack conversion
LangChain conversion
Configuration
Dataloaders integrate with YAML configuration:Streaming mode
By default, loaders use streaming to handle large datasets efficiently:Custom loaders
Implement custom loaders by extendingBaseDatasetLoader:
Error handling
Dataloaders raise specific exceptions for different error conditions:Best practices
Use the catalog
Always use
DataloaderCatalog.create() instead of instantiating loaders directly for consistencySet limits during development
Use
limit= parameter during prototyping to avoid loading full datasetsConvert once
Convert to framework documents once during indexing, not repeatedly during queries
Handle exceptions
Catch
DatasetLoadError and DatasetValidationError for robust pipelines