Skip to main content
Dataloader APIs for loading benchmark datasets and converting them to Haystack or LangChain document formats.

DataloaderCatalog

Factory class for creating dataset loaders by name.

Methods

create

Create a dataset loader instance.
DataloaderCatalog.create(
    name: Literal["triviaqa", "arc", "popqa", "factscore", "earnings_calls"],
    split: str = "test",
    limit: int | None = None,
    dataset_id: str | None = None
) -> BaseDatasetLoader
name
Literal
required
Dataset type name. Options: “triviaqa”, “arc”, “popqa”, “factscore”, “earnings_calls”
split
str
default:"test"
Dataset split to load (e.g., “train”, “test”, “validation”)
limit
int
Optional limit on record count to load
dataset_id
str
Optional HuggingFace dataset ID override
loader
BaseDatasetLoader
Configured dataset loader instance ready to load data

supported_datasets

Return list of supported dataset identifiers.
DataloaderCatalog.supported_datasets() -> tuple[DatasetType, ...]
datasets
tuple[DatasetType, ...]
Tuple of supported dataset type names

LoadedDataset

Wrapper for normalized dataset records with conversion methods.

Constructor

LoadedDataset(
    dataset_type: DatasetType,
    records: list[DatasetRecord]
)
dataset_type
DatasetType
required
Identifier of the dataset (e.g., “triviaqa”, “arc”)
records
list[DatasetRecord]
required
Normalized dataset records

Methods

records

Return normalized dataset records.
records() -> list[DatasetRecord]
records
list[DatasetRecord]
List of normalized dataset records

to_dict_items

Convert normalized records to dictionary items.
to_dict_items() -> list[dict[str, Any]]
items
list[dict[str, Any]]
List of dictionary representations of records

to_haystack

Convert records to Haystack documents.
to_haystack() -> list[HaystackDocument]
documents
list[HaystackDocument]
List of Haystack Document objects ready for indexing

to_langchain

Convert records to LangChain documents.
to_langchain() -> list[LangChainDocument]
documents
list[LangChainDocument]
List of LangChain Document objects ready for indexing

evaluation_queries

Extract evaluation queries from records.
evaluation_queries(limit: int | None = None) -> list[EvaluationQuery]
limit
int
Optional limit applied after deduplication
queries
list[EvaluationQuery]
List of evaluation queries with ground truth answers for retrieval testing

BaseDatasetLoader

Base class for dataset loaders. All specific dataset loaders inherit from this class.

Supported Datasets

  • TriviaQALoader: TriviaQA question-answering dataset
  • ARCLoader: AI2 Reasoning Challenge (ARC) dataset
  • PopQALoader: PopQA popularity-based question-answering dataset
  • FactScoreLoader: FactScore factual consistency dataset
  • EarningsCallsLoader: Earnings calls transcripts dataset

Usage Examples

Load a dataset

from vectordb.dataloaders import DataloaderCatalog

# Create loader
loader = DataloaderCatalog.create(
    name="triviaqa",
    split="test",
    limit=100
)

# Load dataset
dataset = loader.load()

# Convert to Haystack documents
haystack_docs = dataset.to_haystack()

# Convert to LangChain documents
langchain_docs = dataset.to_langchain()

# Get evaluation queries
queries = dataset.evaluation_queries(limit=50)

List supported datasets

from vectordb.dataloaders import DataloaderCatalog

# Get all supported datasets
datasets = DataloaderCatalog.supported_datasets()
print(datasets)
# ('triviaqa', 'arc', 'popqa', 'factscore', 'earnings_calls')

Build docs developers (and LLMs) love