Skip to main content

Overview

The loader.py module handles loading Hadith PDFs, processing them into chunks, generating embeddings, and initializing the Chroma vector database.

Functions

load_and_prepare_data()

Loads PDF documents from the data/ directory, splits them into chunks, generates embeddings, and creates a Chroma vector store. This function is decorated with @st.cache_resource to ensure it only runs once per Streamlit session.
None
void
This function takes no parameters.

Returns

db
Chroma
The initialized Chroma vector store containing all document embeddings and metadata.
embeddings
HuggingFaceEmbeddings
The HuggingFace embeddings model (sentence-transformers/all-MiniLM-L6-v2) used for vectorization.

Processing Steps

  1. PDF Loading: Loads all PDFs from the data/ directory using PyPDFDirectoryLoader
  2. Metadata Extraction: Extracts source names from file paths and updates document metadata
  3. Text Splitting: Splits documents into chunks using regex patterns to identify hadith boundaries:
    • Pattern: Chapter \d+: or Book \d+, Number \d+:
    • Chunk size: 1000 characters
    • Chunk overlap: 0
  4. Metadata Enhancement: Extracts hadith numbers from text and adds to chunk metadata
  5. Embedding Generation: Creates embeddings using sentence-transformers/all-MiniLM-L6-v2
  6. Vector Store Creation: Stores embeddings in Chroma database

Side Effects

This function creates a database/chroma_db/ directory in the project root to persist the vector store.

Function Signature

@st.cache_resource
def load_and_prepare_data():
    # ... implementation
    return db, embeddings

Example Usage

from loader import load_and_prepare_data

# Initialize vector store and embeddings
db, embeddings = load_and_prepare_data()

# Use the database for retrieval
retriever = db.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 10}
)

# Search for relevant documents
results = retriever.invoke("What does Islam say about prayer?")

Dependencies

  • langchain_community.document_loaders.PyPDFDirectoryLoader
  • langchain_text_splitters.RecursiveCharacterTextSplitter
  • langchain_huggingface.HuggingFaceEmbeddings
  • langchain_chroma.Chroma
  • streamlit (for caching)
  • re (for regex pattern matching)

Configuration

folder_path
string
default:"data/"
Directory containing Hadith PDF files
persist_directory
string
default:"database/chroma_db"
Directory where Chroma vector store is persisted
chunk_size
integer
default:"1000"
Maximum size of each text chunk in characters
chunk_overlap
integer
default:"0"
Number of overlapping characters between chunks
model_name
string
default:"sentence-transformers/all-MiniLM-L6-v2"
HuggingFace embedding model identifier

Build docs developers (and LLMs) love