Skip to main content
The dataloaders module provides a unified interface for loading benchmark datasets, normalizing their formats, and converting them to framework-specific document objects.

Overview

Dataloaders abstract away dataset-specific formats and provide:
  • Catalog-based loader creation - Factory pattern for consistent instantiation
  • Normalized record format - All datasets produce DatasetRecord objects
  • Framework conversion - Automatic conversion to Haystack or LangChain documents
  • Evaluation queries - Ground-truth QA pairs for retrieval benchmarking
  • Streaming support - Memory-efficient iteration over large datasets

Supported datasets

DatasetTypeRecordsQueriesDescription
TriviaQA (triviaqa)Open-domain QA~500 index~100 evalTrivia questions with evidence documents
ARC (arc)Science QA~1000 index~200 evalAI2 Reasoning Challenge questions
PopQA (popqa)Entity-centric QA~500 index~100 evalEntity-focused questions from Wikipedia
FActScore (factscore)Factuality QA~500 index~100 evalFactuality-focused evaluation dataset
Earnings Calls (earnings_calls)Financial QA~300 index~50 evalFinancial QA from earnings call transcripts

Architecture

Core components

dataloaders/
├── catalog.py          # Factory for creating loaders
├── base.py             # Abstract base class defining the contract
├── types.py            # Shared types (DatasetRecord, EvaluationQuery)
├── converters.py       # Framework document conversion
├── dataset.py          # LoadedDataset wrapper
├── evaluation.py       # Evaluation query extraction
└── datasets/           # Per-dataset implementations
    ├── triviaqa.py
    ├── arc.py
    ├── popqa.py
    ├── factscore.py
    └── earnings_calls.py

Class hierarchy

BaseDatasetLoader (ABC)
    ├── TriviaQALoader
    ├── ARCLoader
    ├── PopQALoader
    ├── FactScoreLoader
    └── EarningsCallsLoader

Basic usage

Creating a loader

Use the DataloaderCatalog factory to create loaders:
from vectordb.dataloaders import DataloaderCatalog

loader = DataloaderCatalog.create(
    name="triviaqa",
    split="test",
    limit=500
)

Loading datasets

# Load normalized records
dataset = loader.load()

print(f"Loaded {len(dataset.records)} records")
print(f"Dataset type: {dataset.dataset_type}")

Converting to framework documents

from vectordb.dataloaders.converters import DocumentConverter, records_to_items

# Convert records to normalized items
items = records_to_items(dataset.records)

# Convert to Haystack documents
haystack_docs = DocumentConverter.to_haystack(items)

# Convert to LangChain documents
langchain_docs = DocumentConverter.to_langchain(items)

Data structures

DatasetRecord

Normalized document record with text and metadata:
@dataclass(frozen=True, slots=True)
class DatasetRecord:
    text: str                    # Document content to index
    metadata: dict[str, Any]     # Dataset-specific metadata
Example:
DatasetRecord(
    text="The Great Wall of China was built over several centuries...",
    metadata={
        "id": "doc_001",
        "source": "triviaqa",
        "title": "Great Wall of China"
    }
)

EvaluationQuery

Evaluation query with ground-truth answers and relevant document IDs:
@dataclass(frozen=True, slots=True)
class EvaluationQuery:
    query: str                      # User/evaluation question
    answers: list[str]              # Ground-truth answers
    relevant_doc_ids: list[str]     # IDs of known relevant docs
    metadata: dict[str, Any]        # Additional metadata
Example:
EvaluationQuery(
    query="When was the Great Wall of China built?",
    answers=["over several centuries", "7th century BC"],
    relevant_doc_ids=["doc_001", "doc_045"],
    metadata={"difficulty": "easy", "category": "history"}
)

LoadedDataset

Wrapper containing loaded records and metadata:
class LoadedDataset:
    dataset_type: DatasetType      # "triviaqa", "arc", etc.
    records: list[DatasetRecord]   # Normalized documents

Dataset implementations

TriviaQA

Dataset ID: triviaqa
HuggingFace: trivia_qa
Structure: Questions with multiple evidence documents
loader = DataloaderCatalog.create("triviaqa", split="test", limit=500)
dataset = loader.load()
Record format:
  • text: Evidence document content
  • metadata.id: Document identifier
  • metadata.title: Document title
  • metadata.question: Associated question

ARC (AI2 Reasoning Challenge)

Dataset ID: arc
HuggingFace: ai2_arc
Structure: Science questions with multiple-choice answers
loader = DataloaderCatalog.create("arc", split="test", limit=1000)
dataset = loader.load()
Record format:
  • text: Question + answer choices
  • metadata.id: Question identifier
  • metadata.question: Question text
  • metadata.answerKey: Correct answer

PopQA

Dataset ID: popqa
HuggingFace: akariasai/PopQA
Structure: Entity-centric questions from Wikipedia
loader = DataloaderCatalog.create("popqa", split="test", limit=500)
dataset = loader.load()
Record format:
  • text: Wikipedia passage
  • metadata.id: Passage identifier
  • metadata.entity: Entity mention
  • metadata.question: Associated question

FActScore

Dataset ID: factscore
HuggingFace: dskar/FActScore
Structure: Factuality-focused QA pairs
loader = DataloaderCatalog.create("factscore", split="test", limit=500)
dataset = loader.load()
Record format:
  • text: Factual statement
  • metadata.id: Statement identifier
  • metadata.topic: Topic/category

Earnings Calls

Dataset ID: earnings_calls
HuggingFace: lamini/earnings-calls-qa
Structure: Financial QA from earnings call transcripts
loader = DataloaderCatalog.create("earnings_calls", split="train", limit=300)
dataset = loader.load()
Record format:
  • text: Transcript excerpt
  • metadata.id: Excerpt identifier
  • metadata.company: Company name
  • metadata.quarter: Reporting quarter

Base loader interface

All loaders implement the BaseDatasetLoader abstract class:
class BaseDatasetLoader(ABC):
    def __init__(
        self,
        dataset_name: str,
        split: str,
        limit: int | None = None,
        streaming: bool = True,
    ) -> None:
        """Initialize the loader with dataset configuration."""

    @property
    @abstractmethod
    def dataset_type(self) -> DatasetType:
        """Return the supported dataset type identifier."""

    @abstractmethod
    def _load_dataset_iterable(self) -> Iterable[Mapping[str, Any]]:
        """Return the raw dataset rows as an iterable."""

    @abstractmethod
    def _parse_row(self, row: Mapping[str, Any]) -> list[DatasetRecord]:
        """Parse a dataset row into normalized records."""

    def load(self) -> LoadedDataset:
        """Load the dataset and return normalized records."""

Document conversion

The DocumentConverter class provides framework-specific conversion:

Haystack conversion

from vectordb.dataloaders.converters import DocumentConverter
from haystack import Document

items = [{"text": "content", "metadata": {"id": "1"}}]
haystack_docs = DocumentConverter.to_haystack(items)

# Result: List[Document]
# Document(content="content", meta={"id": "1"})

LangChain conversion

from vectordb.dataloaders.converters import DocumentConverter
from langchain_core.documents import Document

items = [{"text": "content", "metadata": {"id": "1"}}]
langchain_docs = DocumentConverter.to_langchain(items)

# Result: List[Document]
# Document(page_content="content", metadata={"id": "1"})

Configuration

Dataloaders integrate with YAML configuration:
dataloader:
  dataset: "triviaqa"    # Dataset identifier
  split: "test"          # Dataset split
  limit: 500             # Optional record limit
Load configuration and create loader:
from vectordb.utils.config import load_config
from vectordb.dataloaders import DataloaderCatalog

config = load_config("config.yaml")
dl_config = config["dataloader"]

loader = DataloaderCatalog.create(
    name=dl_config["dataset"],
    split=dl_config["split"],
    limit=dl_config.get("limit")
)

Streaming mode

By default, loaders use streaming to handle large datasets efficiently:
loader = DataloaderCatalog.create(
    name="triviaqa",
    split="test",
    limit=None  # No limit
)

# Streams dataset without loading all records into memory
dataset = loader.load()

Custom loaders

Implement custom loaders by extending BaseDatasetLoader:
from vectordb.dataloaders.base import BaseDatasetLoader
from vectordb.dataloaders.types import DatasetRecord, DatasetType

class CustomLoader(BaseDatasetLoader):
    @property
    def dataset_type(self) -> DatasetType:
        return "custom"

    def _load_dataset_iterable(self):
        # Load raw dataset rows
        from datasets import load_dataset
        dataset = load_dataset(
            self.dataset_name,
            split=self.split,
            streaming=self.streaming
        )
        return dataset

    def _parse_row(self, row):
        # Parse row into DatasetRecord objects
        return [
            DatasetRecord(
                text=row["content"],
                metadata={"id": row["id"]}
            )
        ]

Error handling

Dataloaders raise specific exceptions for different error conditions:
from vectordb.dataloaders.types import (
    UnsupportedDatasetError,
    DatasetLoadError,
    DatasetValidationError
)

try:
    loader = DataloaderCatalog.create("invalid_dataset")
except UnsupportedDatasetError as e:
    print(f"Dataset not supported: {e}")

try:
    dataset = loader.load()
except DatasetLoadError as e:
    print(f"Failed to load dataset: {e}")
except DatasetValidationError as e:
    print(f"Dataset validation failed: {e}")

Best practices

Use the catalog

Always use DataloaderCatalog.create() instead of instantiating loaders directly for consistency

Set limits during development

Use limit= parameter during prototyping to avoid loading full datasets

Convert once

Convert to framework documents once during indexing, not repeatedly during queries

Handle exceptions

Catch DatasetLoadError and DatasetValidationError for robust pipelines

Build docs developers (and LLMs) love