Overview
Theloader.py module handles loading Hadith PDFs, processing them into chunks, generating embeddings, and initializing the Chroma vector database.
Functions
load_and_prepare_data()
Loads PDF documents from thedata/ directory, splits them into chunks, generates embeddings, and creates a Chroma vector store. This function is decorated with @st.cache_resource to ensure it only runs once per Streamlit session.
This function takes no parameters.
Returns
The initialized Chroma vector store containing all document embeddings and metadata.
The HuggingFace embeddings model (
sentence-transformers/all-MiniLM-L6-v2) used for vectorization.Processing Steps
- PDF Loading: Loads all PDFs from the
data/directory usingPyPDFDirectoryLoader - Metadata Extraction: Extracts source names from file paths and updates document metadata
- Text Splitting: Splits documents into chunks using regex patterns to identify hadith boundaries:
- Pattern:
Chapter \d+:orBook \d+, Number \d+: - Chunk size: 1000 characters
- Chunk overlap: 0
- Pattern:
- Metadata Enhancement: Extracts hadith numbers from text and adds to chunk metadata
- Embedding Generation: Creates embeddings using
sentence-transformers/all-MiniLM-L6-v2 - Vector Store Creation: Stores embeddings in Chroma database
Side Effects
Function Signature
Example Usage
Dependencies
langchain_community.document_loaders.PyPDFDirectoryLoaderlangchain_text_splitters.RecursiveCharacterTextSplitterlangchain_huggingface.HuggingFaceEmbeddingslangchain_chroma.Chromastreamlit(for caching)re(for regex pattern matching)
Configuration
Directory containing Hadith PDF files
Directory where Chroma vector store is persisted
Maximum size of each text chunk in characters
Number of overlapping characters between chunks
HuggingFace embedding model identifier
